Row-wise Similarity and Distance Metrics
get_sims.Rd
get_sims(df, col1:col2, list(sim = vec2))
is essentially
equivalent to mutate(rowwise(df), sim = cos_sim(c_across(col1:col2), vec2))
.
Includes methods for dataframes (in the style of dplyr
), embeddings
objects, and matrices.
Usage
get_sims(x, ...)
# S3 method for class 'embeddings'
get_sims(
x,
y,
method = c("cosine", "cosine_squished", "euclidean", "minkowski", "dot_prod",
"anchored"),
...
)
# S3 method for class 'data.frame'
get_sims(
x,
cols,
y,
method = c("cosine", "cosine_squished", "euclidean", "minkowski", "dot_prod",
"anchored"),
...,
.keep_all = "except.embeddings"
)
Arguments
- x
an embeddings object, matrix, or dataframe with one embedding per row
- ...
additional parameters to be passed to method function
- y
a named list of vectors with the same dimensionality as embeddings in x. Each item will result in a column in the output, showing the similarity of each embedding in x to the vector specified in y. When
method = "anchored"
, each item of y should be a list with named vectorspos
andneg
.- method
either the name of a method to compute similarity or distance, or a function that takes two vectors,
x
andy
, and outputs a scalar, similar to those listed in Similarity and Distance Metrics- cols
tidyselect - columns that contain numeric embedding values
- .keep_all
If
TRUE
, all columns from input are retained in output. IfFALSE
, only similarity metrics will be included. If"except.embeddings"
(the default), all columns except those used to compute the similarity will be retained.
Details
Available Methods
When method
is the name of one of the following supported methods,
computations are done with matrix operations and are therefore blazing fast.
cosine
: cosine similaritycosine_squished
: cosine similarity, rescaled to range from 0 to 1euclidean
: Euclidean distanceminkowski
: Minkowski distance; requires parameterp
. Whenp = 1
(the default), this is the Manhattan distance. Whenp = 2
, it is the Euclidean distance. Whenp = Inf
, it is the Chebyshev distance.dot_prod
: Dot productanchored
:x
is projected onto the range between two anchor points, such that vectors aligned withpos
are given a score of 1 and those aligned withneg
are given a score of 0. For more on anchored vectors, see Data Science for Psychology: Natural Language, Chapter 20.
When method
is a custom function, operations are performed for each row and
may be slow for large inputs.
Value
A tibble with columns doc_id
, and similarity metrics.
If .keep_all = TRUE
or .keep_all = "except.embeddings"
, the new columns
will appear after existing ones.
Examples
valence_embeddings <- predict(glove_twitter_25d, c("good", "bad"))
happy_vec <- predict(glove_twitter_25d, "happy")
sad_vec <- predict(glove_twitter_25d, "sad")
valence_embeddings |>
get_sims(list(happy = happy_vec))
#> # A tibble: 2 × 2
#> doc_id happy
#> <chr> <dbl>
#> 1 good 0.883
#> 2 bad 0.707
valence_embeddings |>
get_sims(
list(happy = list(pos = happy_vec, neg = sad_vec)),
anchored_sim
)
#> # A tibble: 2 × 2
#> doc_id happy
#> <chr> <dbl>
#> 1 good 0.601
#> 2 bad 0.106
valence_embeddings |>
get_sims(
list(happy = happy_vec),
method = function(x, y) sum(abs(x - y))
)
#> # A tibble: 2 × 2
#> doc_id happy
#> <chr> <dbl>
#> 1 good 9.70
#> 2 bad 17.0
valence_df <- tibble::as_tibble(valence_embeddings, rownames = "token")
valence_df |> get_sims(
dim_1:dim_25,
list(happy = happy_vec, sad = sad_vec),
.keep_all = TRUE
)
#> # A tibble: 2 × 28
#> token dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim_10
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 good -0.544 0.603 -0.145 -0.0234 -0.138 0.601 2.19 0.208 -0.515 -0.231
#> 2 bad 0.414 0.0223 0.0565 -0.0105 0.274 0.713 1.64 -0.112 -0.262 0.108
#> # ℹ 17 more variables: dim_11 <dbl>, dim_12 <dbl>, dim_13 <dbl>, dim_14 <dbl>,
#> # dim_15 <dbl>, dim_16 <dbl>, dim_17 <dbl>, dim_18 <dbl>, dim_19 <dbl>,
#> # dim_20 <dbl>, dim_21 <dbl>, dim_22 <dbl>, dim_23 <dbl>, dim_24 <dbl>,
#> # dim_25 <dbl>, happy <dbl>, sad <dbl>