Tools for Working With Text Embeddings • embedplyr

Overview

embedplyr enables common operations with word and text embeddings within a ‘tidyverse’ and/or ‘quanteda’ workflow, as demonstrated in Data Science for Psychology: Natural Language.

load_embeddings() loads pretrained GloVe, word2vec, HistWords, ConceptNet Numberbatch, and fastText word embedding models from Internet sources or from your working directory
embed_tokens() returns the embedding for each token in a set of texts
embed_docs() generates text embeddings for a set of documents
get_sims() calculates row-wise similarity metrics between a set of embeddings and a given reference
average_embedding() calculates the (weighted) average of multiple embeddings
reduce_dimensionality() reduces the dimensionality of embeddings
align_embeddings() rotates the embeddings from one model so that they can be compared with those from another
normalize() and normalize_rows() normalize embeddings to the unit hypersphere
and more…

Installation

You can install the development version of embedplyr from GitHub with:

remotes::install_github("rimonim/embedplyr")

Functionality

embedplyr is designed to facilitate the use of word and text embeddings in common data manipulation and text analysis workflows, without introducing new syntax or unfamiliar data structures.

embedplyr is model agnostic; it can be used to work with embeddings from decontextualized models like GloVe and word2vec, or from contextualized models like BERT or others made available through the ‘text’ package.

library(embedplyr)

Loading Pretrained Embeddings

embedplyr won’t help you train new embedding models, but it can load embeddings from a file or download them from online. This is especially useful for pretrained word embedding models like GloVe, word2vec, and fastText. Hundreds of these models can be conveniently downloaded from online sources with load_embeddings().

One particularly useful feature of load_embeddings() is the optional words parameter, which allows the user to specify a subset of words to load from the model. This allows users to work with large models, which are often too large to load into an interactive environment in their entirety. For example, a user may specify the set of unique tokens in their corpus of interest, and load only these from the model.

# load 25d GloVe model trained on Twitter
glove_twitter_25d <- load_embeddings("glove.twitter.27B.25d")

# load words from 300d model trained on Google Books English Fiction 1800-1810
eng.fiction.all_sgns.1800 <- load_embeddings(
    "eng.fiction.all_sgns.1800",
    words = c("word", "token", "lemma")
    )

The outcome is an embeddings object. An embeddings object is just a numeric matrix with fast hash table indexing by rownames (generally tokens). This means that it can be easily coerced to a dataframe or tibble, while also allowing special embeddings-specific methods and functions, such as predict.embeddings() and find_nearest():

moral_embeddings <- predict(glove_twitter_25d, c("good", "bad"))
moral_embeddings
#> # 25-dimensional embeddings with 2 rows
#>      dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim..      
#> good -0.54  0.60 -0.15 -0.02 -0.14  0.60  2.19  0.21 -0.52 -0.23 ...  
#> bad   0.41  0.02  0.06 -0.01  0.27  0.71  1.64 -0.11 -0.26  0.11 ...

find_nearest(glove_twitter_25d, "dog", 5L, method = "cosine")
#> # 25-dimensional embeddings with 5 rows
#>        dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim..      
#> dog    -1.24 -0.36  0.57  0.37  0.60 -0.19  1.27 -0.37  0.09  0.40 ...  
#> cat    -0.96 -0.61  0.67  0.35  0.41 -0.21  1.38  0.13  0.32  0.66 ...  
#> dogs   -0.63 -0.11  0.22  0.27  0.28  0.13  1.44 -1.18 -0.26  0.60 ...  
#> horse  -0.76 -0.63  0.43  0.04  0.25 -0.18  1.08 -0.94  0.30  0.07 ...  
#> monkey -0.96 -0.38  0.49  0.66  0.21 -0.09  1.28 -0.11  0.27  0.42 ...

Whereas indexing a regular matrix by rownames gets slower as the number of rows increases, embedplyr’s hash table indexing means that token embeddings can be retrieved in milliseconds even from models with millions of rows (see the performance vignette).

Similarity Metrics

Functions for similarity and distance metrics are as simple as possible; each one takes in vectors and outputs a scalar.

vec1 <- c(1, 5, 2)
vec2 <- c(4, 2, 2)
vec3 <- c(-1, -2, -13)

dot_prod(vec1, vec2)                        # dot product
#> [1] 18
cos_sim(vec1, vec2)                         # cosine similarity
#> [1] 0.6708204
euc_dist(vec1, vec2)                        # Euclidean distance
#> [1] 4.242641
anchored_sim(vec1, pos = vec2, neg = vec3)  # projection to an anchored vector
#> [1] 0.9887218

Example Tidy Workflow

Given a tidy dataframe of texts, embed_docs() will generate embeddings by averaging the embeddings of words in each text (for more information on why this works well, see Data Science for Psychology, Chapter 18). By default, embed_docs() uses a simple unweighted mean, but other averaging methods are available.

library(dplyr)
valence_df <- tribble(
    ~id,        ~text,
    "positive", "happy awesome cool nice",
    "neutral",  "ok fine sure whatever",
    "negative", "sad bad horrible angry"
    )

valence_embeddings_df <- valence_df |> 
    embed_docs("text", glove_twitter_25d, id_col = "id", .keep_all = TRUE)
valence_embeddings_df
#> # A tibble: 3 × 27
#>   id       text       dim_1   dim_2    dim_3   dim_4   dim_5   dim_6 dim_7 dim_8
#>   <chr>    <chr>      <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl> <dbl> <dbl>
#> 1 positive happy a… -0.584  -0.0810 -0.00361 -0.381   0.0786  0.646   1.66 0.543
#> 2 neutral  ok fine… -0.0293  0.169  -0.226   -0.175  -0.389  -0.0313  1.22 0.222
#> 3 negative sad bad…  0.296  -0.244   0.150    0.0809  0.155   0.728   1.51 0.122
#> # ℹ 17 more variables: dim_9 <dbl>, dim_10 <dbl>, dim_11 <dbl>, dim_12 <dbl>,
#> #   dim_13 <dbl>, dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>,
#> #   dim_18 <dbl>, dim_19 <dbl>, dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>,
#> #   dim_23 <dbl>, dim_24 <dbl>, dim_25 <dbl>

embed_docs() can also be used to generate other types of embeddings. For example, we can use the ‘text’ package to generate embeddings using any model available from Hugging Face Transformers.

# add embeddings to data frame
valence_sbert_df <- valence_df |> 
    embed_docs(
        "text", text::textEmbed, id_col = "id", .keep_all = TRUE,
        model = "sentence-transformers/all-MiniLM-L12-v2"
        )

To quantify how good and how intense the texts are, we can compare them to the embeddings for “good” and “intense” using get_sims(). Note that this step requires only a dataframe, tibble, or embeddings object with numeric columns; the embeddings can come from any source.

good_vec <- predict(glove_twitter_25d, "good")
intense_vec <- predict(glove_twitter_25d, "intense")
valence_quantified <- valence_embeddings_df |> 
    get_sims(
        dim_1:dim_25, 
        list(
            good = good_vec, 
            intense = intense_vec
            )
        )
valence_quantified
#> # A tibble: 3 × 4
#>   id       text                     good intense
#>   <chr>    <chr>                   <dbl>   <dbl>
#> 1 positive happy awesome cool nice 0.958   0.585
#> 2 neutral  ok fine sure whatever   0.909   0.535
#> 3 negative sad bad horrible angry  0.848   0.747

Example Quanteda Workflow

library(quanteda)

# corpus
valence_corp <- corpus(valence_df, docid_field = "id")
valence_corp
#> Corpus consisting of 3 documents.
#> positive :
#> "happy awesome cool nice"
#> 
#> neutral :
#> "ok fine sure whatever"
#> 
#> negative :
#> "sad bad horrible angry"

# dfm
valence_dfm <- valence_corp |> 
    tokens() |> 
    dfm()

# compute embeddings
valence_embeddings_df <- valence_dfm |> 
    textstat_embedding(glove_twitter_25d)
valence_embeddings_df
#> # A tibble: 3 × 26
#>   doc_id     dim_1   dim_2    dim_3   dim_4   dim_5   dim_6 dim_7 dim_8  dim_9
#>   <chr>      <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl> <dbl> <dbl>  <dbl>
#> 1 positive -0.584  -0.0810 -0.00361 -0.381   0.0786  0.646   1.66 0.543 -0.830
#> 2 neutral  -0.0293  0.169  -0.226   -0.175  -0.389  -0.0313  1.22 0.222 -0.394
#> 3 negative  0.296  -0.244   0.150    0.0809  0.155   0.728   1.51 0.122 -0.588
#> # ℹ 16 more variables: dim_10 <dbl>, dim_11 <dbl>, dim_12 <dbl>, dim_13 <dbl>,
#> #   dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>, dim_18 <dbl>,
#> #   dim_19 <dbl>, dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>,
#> #   dim_24 <dbl>, dim_25 <dbl>

Other Functions

Reduce Dimensionality

It is sometimes useful to reduce the dimensionality of embeddings. This is done with reduce_dimensionality(), which by default performs PCA without column normalization.

valence_df_2d <- valence_embeddings_df |> 
    reduce_dimensionality(dim_1:dim_25, 2)
valence_df_2d
#> # A tibble: 3 × 3
#>   doc_id      PC1    PC2
#> * <chr>     <dbl>  <dbl>
#> 1 positive -1.47   0.494
#> 2 neutral   0.121 -1.13 
#> 3 negative  1.35   0.640

reduce_dimensionality() can also be used to apply the same rotation to other embeddings not used to find the principle components.

new_embeddings <- predict(glove_twitter_25d, c("new", "strange"))

# get rotation with `output_rotation = TRUE`
valence_rotation_2d <- valence_embeddings_df |> 
    reduce_dimensionality(dim_1:dim_25, 2, output_rotation = TRUE)

# apply the same rotation to new embeddings
new_with_valence_rotation <- new_embeddings |> 
    reduce_dimensionality(custom_rotation = valence_rotation_2d)
new_with_valence_rotation
#> # 2-dimensional embeddings with 2 rows
#>         PC1   PC2  
#> new     -2.38  0.24
#> strange  0.09  1.18

Align and Compare Embeddings Models

Aligning separately trained embeddings can be useful for tracking semantic change over time or semantic differences between training datasets. It can also be useful for comparing texts across languages. align_embeddings(x, y) rotates the embeddings in x (and changes their dimensionality if necessary) so that they can be compared with those in y. Optionally, matching can be used to specify one-to-one matching between embeddings in the two models (e.g. a bilingual dictionary). See align_embeddings() for full documentation.

Once aligned, groups of embeddings can be compared using total_dist() or average_sim(). These can be used to quantify the overall distance or similarity between two parallel embedding spaces.

Normalize (Scale Embeddings to the Unit Hypersphere)

normalize() and normalize_rows() scale embeddings such that their magnitude is 1, while their angle from the origin is unchanged.

normalize(good_vec)
#>        dim_1        dim_2        dim_3        dim_4        dim_5        dim_6 
#> -0.090587846  0.100363800 -0.024215926 -0.003896062 -0.022930449  0.100135678 
#>        dim_7        dim_8        dim_9       dim_10       dim_11       dim_12 
#>  0.364995604  0.034641280 -0.085813930 -0.038466074 -0.133854478  0.094747331 
#>       dim_13       dim_14       dim_15       dim_16       dim_17       dim_18 
#> -0.836459360  0.044137493  0.079744546 -0.099664447  0.093466849 -0.181581983 
#>       dim_19       dim_20       dim_21       dim_22       dim_23       dim_24 
#> -0.087563977  0.020824065 -0.037671809  0.040843874 -0.076207818  0.154222299 
#>       dim_25 
#>  0.003684091

normalize(moral_embeddings)
#> # 25-dimensional embeddings with 2 rows
#>      dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim..      
#> good -0.09  0.10 -0.02 -0.00 -0.02  0.10  0.36  0.03 -0.09 -0.04 ...  
#> bad   0.08  0.00  0.01 -0.00  0.05  0.13  0.31 -0.02 -0.05  0.02 ...

valence_embeddings_df |> normalize_rows(dim_1:dim_25)
#> # A tibble: 3 × 26
#>   doc_id    dim_1   dim_2    dim_3   dim_4   dim_5    dim_6 dim_7  dim_8   dim_9
#>   <chr>     <dbl>   <dbl>    <dbl>   <dbl>   <dbl>    <dbl> <dbl>  <dbl>   <dbl>
#> 1 posit… -0.118   -0.0163 -7.26e-4 -0.0767  0.0158  0.130   0.334 0.109  -0.167 
#> 2 neutr… -0.00633  0.0365 -4.87e-2 -0.0377 -0.0839 -0.00675 0.262 0.0479 -0.0850
#> 3 negat…  0.0666  -0.0549  3.38e-2  0.0182  0.0347  0.164   0.339 0.0274 -0.132 
#> # ℹ 16 more variables: dim_10 <dbl>, dim_11 <dbl>, dim_12 <dbl>, dim_13 <dbl>,
#> #   dim_14 <dbl>, dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>, dim_18 <dbl>,
#> #   dim_19 <dbl>, dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>,
#> #   dim_24 <dbl>, dim_25 <dbl>

Magnitude

The magnitude, norm, or length of a vector is its Euclidean distance from the origin.

magnitude(good_vec)
#> [1] 6.005552

magnitude(moral_embeddings)
#>     good      bad 
#> 6.005552 5.355951