Overview
embedplyr enables common operations with word and text embeddings within a ‘tidyverse’ and/or ‘quanteda’ workflow, as demonstrated in Data Science for Psychology: Natural Language.
-
load_embeddings()
loads pretrained GloVe, word2vec, ConceptNet Numberbatch, and fastText word embedding models from Internet sources or from your working directory -
embed_tokens()
returns the embedding for each token in a set of texts -
embed_docs()
generates text embeddings for a set of documents -
get_sims()
calculates row-wise similarity metrics between a set of embeddings and a given reference -
average_embedding()
calculates the (weighted) average of multiple embeddings -
reduce_dimensionality()
reduces the dimensionality of embeddings -
normalize()
andnormalize_rows()
normalize embeddings to the unit hypersphere - and more…
Installation
You can install the development version of embedplyr from GitHub with:
remotes::install_github("rimonim/embedplyr")
Functionality
embedplyr is designed to facilitate the use of word and text embeddings in common data manipulation and text analysis workflows, without introducing new syntax or unfamiliar data structures.
embedplyr is model agnostic; it can be used to work with embeddings from decontextualized models like GloVe and word2vec, or from contextualized models like BERT or others made available through the ‘text’ package.
Loading Pretrained Embeddings
embedplyr won’t help you train new embedding models, but it can load embeddings from a file or download them from online. This is especially useful for pretrained word embedding models like GloVe, word2vec, and fastText. Dozens of these models can be conveniently downloaded from online sources with load_embeddings()
.
library(embedplyr)
glove_twitter_25d <- load_embeddings("glove.twitter.27B.25d")
The outcome is an embeddings object. An embeddings object is just a numeric matrix with fast hash table indexing by rownames (generally tokens). This means that it can be easily coerced to a dataframe or tibble, while also allowing special embeddings-specific methods and functions, such as predict.embeddings()
and find_nearest()
:
moral_embeddings <- predict(glove_twitter_25d, c("good", "bad"))
moral_embeddings
#> # 25-dimensional embeddings with 2 rows
#> dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim..
#> good -0.54 0.60 -0.15 -0.02 -0.14 0.60 2.19 0.21 -0.52 -0.23 ...
#> bad 0.41 0.02 0.06 -0.01 0.27 0.71 1.64 -0.11 -0.26 0.11 ...
find_nearest(glove_twitter_25d, "dog", 5L, method = "cosine")
#> # 25-dimensional embeddings with 5 rows
#> dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim..
#> dog -1.24 -0.36 0.57 0.37 0.60 -0.19 1.27 -0.37 0.09 0.40 ...
#> cat -0.96 -0.61 0.67 0.35 0.41 -0.21 1.38 0.13 0.32 0.66 ...
#> dogs -0.63 -0.11 0.22 0.27 0.28 0.13 1.44 -1.18 -0.26 0.60 ...
#> horse -0.76 -0.63 0.43 0.04 0.25 -0.18 1.08 -0.94 0.30 0.07 ...
#> monkey -0.96 -0.38 0.49 0.66 0.21 -0.09 1.28 -0.11 0.27 0.42 ...
Whereas indexing a regular matrix by rownames gets slower as the number of rows increases, embedingplyr’s hash table indexing means that token embeddings can be retrieved in milliseconds even from models with millions of rows.
Similarity Metrics
Functions for similarity and distance metrics are as simple as possible; each one takes in vectors and outputs a scalar.
vec1 <- c(1, 5, 2)
vec2 <- c(4, 2, 2)
vec3 <- c(-1, -2, -13)
dot_prod(vec1, vec2) # dot product
#> [1] 18
cos_sim(vec1, vec2) # cosine similarity
#> [1] 0.6708204
euc_dist(vec1, vec2) # Euclidean distance
#> [1] 4.242641
anchored_sim(vec1, pos = vec2, neg = vec3) # projection to an anchored vector
#> [1] 0.9887218
Example Tidy Workflow
Given a tidy dataframe of texts, embed_docs()
will generate embeddings by averaging the embeddings of words in each text (for more information on why this works well, see Data Science for Psychology, Chapter 18). By default, embed_docs()
uses a simple unweighted mean, but other averaging methods are available.
library(dplyr)
valence_df <- tribble(
~id, ~text,
"positive", "happy awesome cool nice",
"neutral", "ok fine sure whatever",
"negative", "sad bad horrible angry"
)
valence_embeddings_df <- valence_df |>
embed_docs("text", glove_twitter_25d, id_col = "id", .keep_all = TRUE)
valence_embeddings_df
#> # A tibble: 3 × 27
#> id text dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 positive happy a… -0.584 -0.0810 -0.00361 -0.381 0.0786 0.646 1.66 0.543
#> 2 neutral ok fine… -0.0293 0.169 -0.226 -0.175 -0.389 -0.0313 1.22 0.222
#> 3 negative sad bad… 0.296 -0.244 0.150 0.0809 0.155 0.728 1.51 0.122
#> # ℹ 17 more variables: dim_9 <dbl>, dim_10 <dbl>, dim_11 <dbl>, dim_12 <dbl>,
#> # dim_13 <dbl>, dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>,
#> # dim_18 <dbl>, dim_19 <dbl>, dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>,
#> # dim_23 <dbl>, dim_24 <dbl>, dim_25 <dbl>
embed_docs()
can also be used to generate other types of embeddings. For example, we can use the ‘text’ package to generate embeddings using any model available from Huggingface transformers.
# function that takes character vector and outputs a data frame
sbert_embeddings <- function(texts) {
text::textEmbed(
texts,
model = "sentence-transformers/all-MiniLM-L12-v2", # model name
layers = -2, # second to last layer (default)
tokens_select = "[CLS]", # use only [CLS] token
dim_name = FALSE,
keep_token_embeddings = FALSE
)$texts[[1]]
}
valence_sbert_df <- valence_df |>
embed_docs("text", sbert_embeddings, id_col = "id", .keep_all = TRUE)
To quantify how good and how intense the texts are, we can compare them to the embeddings for “good” and “intense” using get_sims()
. Note that this step requires only a dataframe, tibble, or embeddings object with numeric columns; the embeddings can come from any source.
good_vec <- predict(glove_twitter_25d, "good")
intense_vec <- predict(glove_twitter_25d, "intense")
valence_quantified <- valence_embeddings_df |>
get_sims(
dim_1:dim_25,
list(
good = good_vec,
intense = intense_vec
)
)
valence_quantified
#> # A tibble: 3 × 4
#> id text good intense
#> <chr> <chr> <dbl> <dbl>
#> 1 positive happy awesome cool nice 0.958 0.585
#> 2 neutral ok fine sure whatever 0.909 0.535
#> 3 negative sad bad horrible angry 0.848 0.747
Example Quanteda Workflow
library(quanteda)
# corpus
valence_corp <- corpus(valence_df, docid_field = "id")
valence_corp
#> Corpus consisting of 3 documents.
#> positive :
#> "happy awesome cool nice"
#>
#> neutral :
#> "ok fine sure whatever"
#>
#> negative :
#> "sad bad horrible angry"
# dfm
valence_dfm <- valence_corp |>
tokens() |>
dfm()
# compute embeddings
valence_embeddings_df <- valence_dfm |>
textstat_embedding(glove_twitter_25d)
valence_embeddings_df
#> # A tibble: 3 × 26
#> doc_id dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 positive -0.584 -0.0810 -0.00361 -0.381 0.0786 0.646 1.66 0.543 -0.830
#> 2 neutral -0.0293 0.169 -0.226 -0.175 -0.389 -0.0313 1.22 0.222 -0.394
#> 3 negative 0.296 -0.244 0.150 0.0809 0.155 0.728 1.51 0.122 -0.588
#> # ℹ 16 more variables: dim_10 <dbl>, dim_11 <dbl>, dim_12 <dbl>, dim_13 <dbl>,
#> # dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>, dim_18 <dbl>,
#> # dim_19 <dbl>, dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>,
#> # dim_24 <dbl>, dim_25 <dbl>
Other Functions
Reduce Dimensionality
It is sometimes useful to reduce the dimensionality of embeddings. This is done with reduce_dimensionality()
, which by default performs PCA without column normalization.
valence_df_2d <- valence_embeddings_df |>
reduce_dimensionality(dim_1:dim_25, 2)
valence_df_2d
#> # A tibble: 3 × 3
#> doc_id PC1 PC2
#> * <chr> <dbl> <dbl>
#> 1 positive -1.47 0.494
#> 2 neutral 0.121 -1.13
#> 3 negative 1.35 0.640
reduce_dimensionality()
can also be used to apply the same rotation other embeddings not used to find the principle components.
new_embeddings <- predict(glove_twitter_25d, c("new", "strange"))
# get rotation with `output_rotation = TRUE`
valence_rotation_2d <- valence_embeddings_df |>
reduce_dimensionality(dim_1:dim_25, 2, output_rotation = TRUE)
# apply the same rotation to new embeddings
new_with_valence_rotation <- new_embeddings |>
reduce_dimensionality(custom_rotation = valence_rotation_2d)
new_with_valence_rotation
#> # 2-dimensional embeddings with 2 rows
#> PC1 PC2
#> new -2.38 0.24
#> strange 0.09 1.18
Normalize (Scale Embeddings to the Unit Hypersphere)
normalize()
and normalize_rows()
scale embeddings such that their magnitude is 1, while their angle from the origin is unchanged.
normalize(good_vec)
#> dim_1 dim_2 dim_3 dim_4 dim_5 dim_6
#> -0.090587846 0.100363800 -0.024215926 -0.003896062 -0.022930449 0.100135678
#> dim_7 dim_8 dim_9 dim_10 dim_11 dim_12
#> 0.364995604 0.034641280 -0.085813930 -0.038466074 -0.133854478 0.094747331
#> dim_13 dim_14 dim_15 dim_16 dim_17 dim_18
#> -0.836459360 0.044137493 0.079744546 -0.099664447 0.093466849 -0.181581983
#> dim_19 dim_20 dim_21 dim_22 dim_23 dim_24
#> -0.087563977 0.020824065 -0.037671809 0.040843874 -0.076207818 0.154222299
#> dim_25
#> 0.003684091
normalize(moral_embeddings)
#> # 25-dimensional embeddings with 2 rows
#> dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim..
#> good -0.09 0.10 -0.02 -0.00 -0.02 0.10 0.36 0.03 -0.09 -0.04 ...
#> bad 0.08 0.00 0.01 -0.00 0.05 0.13 0.31 -0.02 -0.05 0.02 ...
valence_embeddings_df |> normalize_rows(dim_1:dim_25)
#> # A tibble: 3 × 26
#> doc_id dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 posit… -0.118 -0.0163 -7.26e-4 -0.0767 0.0158 0.130 0.334 0.109 -0.167
#> 2 neutr… -0.00633 0.0365 -4.87e-2 -0.0377 -0.0839 -0.00675 0.262 0.0479 -0.0850
#> 3 negat… 0.0666 -0.0549 3.38e-2 0.0182 0.0347 0.164 0.339 0.0274 -0.132
#> # ℹ 16 more variables: dim_10 <dbl>, dim_11 <dbl>, dim_12 <dbl>, dim_13 <dbl>,
#> # dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>, dim_18 <dbl>,
#> # dim_19 <dbl>, dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>,
#> # dim_24 <dbl>, dim_25 <dbl>