
Get Text Embeddings by Averaging Word Embeddings
embed_docs.Rd
textstat_embedding()
takes a 'quanteda' dfm. embed_docs()
is a more
versatile function for which acts directly on either a character vector or
a column of texts in a dataframe.
Usage
embed_docs(x, ...)
# Default S3 method
embed_docs(
x,
model,
w = NULL,
method = "mean",
...,
tolower = TRUE,
output_embeddings = FALSE
)
# S3 method for class 'data.frame'
embed_docs(
x,
text_col,
model,
id_col = NULL,
w = NULL,
method = "mean",
...,
.keep_all = FALSE,
tolower = TRUE,
output_embeddings = FALSE
)
textstat_embedding(
dfm,
model,
w = NULL,
method = "mean",
output_embeddings = FALSE
)
Arguments
- x
a character vector, a data frame, or data frame extension (e.g. a tibble)
- ...
additional parameters to pass to
quanteda::tokens()
or to the user-specified modeling function- model
an embeddings object. For
embed_docs()
,model
can alternatively be a function that takes a character vector and outputs a dataframe with a row for each element of the input.- w
optional weighting for embeddings in
model
ifmodel
is an embeddings object. Seeaverage_embedding()
.- method
method to use for averaging. See
average_embedding()
. Note thatmethod = "median"
does not use matrix operations and may therefore be slow for datasets with many documents.- tolower
logical. Convert all text to lowercase? If
model
is an embeddings object, this value is passed toquanteda::dfm()
.- output_embeddings
FALSE
(the default) returns a tibble.TRUE
returns an embeddings object. See 'Value' for details.- text_col
string. a column of texts for which to compute embeddings
- id_col
optional string. column of unique document ids
- .keep_all
logical. Keep all columns from input? Ignored if
output_embeddings = TRUE
.- dfm
a quanteda dfm
Value
If output_embeddings = FALSE
, a tibble with columns doc_id
, and dim_1
, dim_2
,
etc. or similar. If .keep_all = TRUE
, the new columns will appear after
existing ones.
If output_embeddings = TRUE
, an embeddings object with document ids as rownames.
Examples
texts <- c("this says one thing", "and this says another")
texts_embeddings <- embed_docs(texts, glove_twitter_25d)
texts_embeddings
#> # A tibble: 2 × 26
#> doc_id dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim_10
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 text1 0.114 0.167 0.180 -0.144 -0.0492 -0.465 1.77 -0.161 -0.414 -0.0989
#> 2 text2 -0.289 0.297 0.0145 -0.108 -0.248 -0.495 1.58 -0.234 -0.0946 -0.177
#> # ℹ 15 more variables: dim_11 <dbl>, dim_12 <dbl>, dim_13 <dbl>, dim_14 <dbl>,
#> # dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>, dim_18 <dbl>, dim_19 <dbl>,
#> # dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>, dim_24 <dbl>,
#> # dim_25 <dbl>
# quanteda workflow
library(quanteda)
#> Package version: 4.3.1
#> Unicode version: 15.1
#> ICU version: 74.2
#> Parallel computing: disabled
#> See https://quanteda.io for tutorials and examples.
texts_dfm <- dfm(tokens(texts))
texts_embeddings <- textstat_embedding(texts_dfm, glove_twitter_25d)
texts_embeddings
#> # A tibble: 2 × 26
#> doc_id dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim_10
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 text1 0.114 0.167 0.180 -0.144 -0.0492 -0.465 1.77 -0.161 -0.414 -0.0989
#> 2 text2 -0.289 0.297 0.0145 -0.108 -0.248 -0.495 1.58 -0.234 -0.0946 -0.177
#> # ℹ 15 more variables: dim_11 <dbl>, dim_12 <dbl>, dim_13 <dbl>, dim_14 <dbl>,
#> # dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>, dim_18 <dbl>, dim_19 <dbl>,
#> # dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>, dim_24 <dbl>,
#> # dim_25 <dbl>
# dplyr workflow
texts_df <- data.frame(text = texts)
texts_embeddings <- texts_df |> embed_docs("text", glove_twitter_25d)
texts_embeddings
#> # A tibble: 2 × 26
#> doc_id dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim_10
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 text1 0.114 0.167 0.180 -0.144 -0.0492 -0.465 1.77 -0.161 -0.414 -0.0989
#> 2 text2 -0.289 0.297 0.0145 -0.108 -0.248 -0.495 1.58 -0.234 -0.0946 -0.177
#> # ℹ 15 more variables: dim_11 <dbl>, dim_12 <dbl>, dim_13 <dbl>, dim_14 <dbl>,
#> # dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>, dim_18 <dbl>, dim_19 <dbl>,
#> # dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>, dim_24 <dbl>,
#> # dim_25 <dbl>