Word Embedding Research Framework for Psychological Science.
PsychWordVec
Word Embedding Research Framework for Psychological Science.
An integrative toolbox of word embedding research that provides:
- A collection of pre-trained static word vectors in the .RData compressed format;
- A series of functions to process, analyze, and visualize word vectors;
- A range of tests to examine conceptual associations, including the Word Embedding Association Test (Caliskan et al., 2017) and the Relative Norm Distance (Garg et al., 2018), with permutation test of significance;
- A set of training methods to locally train (static) word vectors from text corpora, including Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and FastText (Bojanowski et al., 2017);
- A group of functions to download pre-trained language models (e.g., GPT, BERT) and extract contextualized (dynamic) word vectors (based on the R package text).
⚠️ All users should update the package to version ≥ 0.3.2. Old versions may have slow processing speed and other problems.
Author
Han-Wu-Shuang (Bruce) Bao 包寒吴霜
Citation
- Bao, H.-W.-S. (2022). PsychWordVec: Word embedding research framework for psychological science. https://CRAN.R-project.org/package=PsychWordVec
- Note: This is the original citation format. Please refer to the information when you
library(PsychWordVec)
for the APA-7 format of your installed version.
- Note: This is the original citation format. Please refer to the information when you
- Bao, H.-W.-S., Wang, Z.-X., Cheng, X., Su, Z., Yang, Y., Zhang, G.-Y., Wang, B., & Cai, H. (2023). Using word embeddings to investigate human psychology: Methods and applications. Advances in Psychological Science, 31(6), 887--904.
[包寒吴霜, 王梓西, 程曦, 苏展, 杨盈, 张光耀, 王博, 蔡华俭. (2023). 基于词嵌入技术的心理学研究:方法及应用. 心理科学进展, 31(6), 887--904.]
Installation
## Method 1: Install from CRAN
install.packages("PsychWordVec")
## Method 2: Install from GitHub
install.packages("devtools")
devtools::install_github("psychbruce/PsychWordVec", force=TRUE)
Types of Data for PsychWordVec
embed | wordvec | |
---|---|---|
Basic class | matrix | data.table |
Row size | vocabulary size | vocabulary size |
Column size | dimension size | 2 (variables: word , vec ) |
Advantage | faster (with matrix operation) | easier to inspect and manage |
Function to get | as_embed() | as_wordvec() |
Function to load | load_embed() | load_wordvec() |
: Note: Word embedding refers to a natural language processing technique that embeds word semantics into a low-dimensional embedding matrix, with each word (actually token) quantified as a numeric vector representing its (uninterpretable) semantic features. Users are suggested to import word vectors data as the embed
class using the function load_embed()
, which would automatically normalize all word vectors to the unit length 1 (see the normalize()
function) and accelerate the running of most functions in PsychWordVec
.
Functions in PsychWordVec
- Word Embeddings Data Management and Transformation
as_embed()
: fromwordvec
(data.table) toembed
(matrix)as_wordvec()
: fromembed
(matrix) towordvec
(data.table)load_embed()
: load word embeddings data asembed
(matrix)load_wordvec()
: load word embeddings data aswordvec
(data.table)data_transform()
: transform plain text word vectors towordvec
orembed
- Word Vectors Extraction, Linear Operation, and Visualization
subset()
: extract a subset ofwordvec
andembed
normalize()
: normalize all word vectors to the unit length 1get_wordvec()
: extract word vectorssum_wordvec()
: calculate the sum vector of multiple wordsplot_wordvec()
: visualize word vectorsplot_wordvec_tSNE()
: 2D or 3D visualization with t-SNEorth_procrustes()
: Orthogonal Procrustes matrix alignment
- Word Semantic Similarity Analysis, Network Analysis, and Association Test
cosine_similarity()
:cos_sim()
orcos_dist()
pair_similarity()
: compute a similarity matrix of word pairsplot_similarity()
: visualize similarities of word pairstab_similarity()
: tabulate similarities of word pairsmost_similar()
: find the Top-N most similar wordsplot_network()
: visualize a (partial correlation) network graph of wordstest_WEAT()
: WEAT and SC-WEAT with permutation test of significancetest_RND()
: RND with permutation test of significance
- Dictionary Automatic Expansion and Reliability Analysis
dict_expand()
: expand a dictionary from the most similar wordsdict_reliability()
: reliability analysis and PCA of a dictionary
- Local Training of Static Word Embeddings (Word2Vec, GloVe, and FastText)
tokenize()
: tokenize raw texttrain_wordvec()
: train static word embeddings
- Pre-trained Language Models (PLM) and Contextualized Word Embeddings
text_init()
: set up a Python environment for PLMtext_model_download()
: download PLMs from Hugging Face to local ".cache" foldertext_model_remove()
: remove PLMs from local ".cache" foldertext_to_vec()
: extract contextualized token and text embeddingstext_unmask()
: \<deprecated\> <please use FMAT> fill in the blank mask(s) in a query
See the documentation (help pages) for their usage and details.