Compare Two Embedding Models
total_dist.Rd
Given two alternative embeddings of a set of tokens or documents, total_dist()
computes a global metric of the distance between the alternatives (by default
the Wasserstein distance).
Usage
total_dist(
x,
y,
matching = NULL,
method = c("euclidean", "minkowski", "cosine", "cosine_squished", "dot_prod"),
average = FALSE,
...
)
average_sim(x, y, matching = NULL, method = "cosine", average = TRUE, ...)
Arguments
- x
an embeddings object
- y
an embeddings object. If
matching = NULL
,y
must contain at least a few rownames matching those ofx
.- matching
(optional) a named character vector specifying a one-to-one matching between rownames of
x
(names) and rownames ofy
(values)- method
either the name of a method to compute similarity or distance, or a function that takes two vectors,
x
andy
, and outputs a scalar, similar to those listed in Similarity and Distance Metrics- average
logical. Should the rowwise distances be averaged as opposed to summed?
- ...
additional parameters to be passed to method function
Details
total_dist()
computes the distance or similarity between the embeddings in
x
and their direct counterparts in y
. average = TRUE
returns the
mean of these values, while average = FALSE
(the default) returns the sum.
For more information on available methods, see get_sims()
.
method = "euclidean"
and average = FALSE
(the default for total_dist()
)
results in the Wasserstein distance,
if embeddings are taken as equally weighted point masses.
average_sim()
is identical to total_dist()
, but with different defaults.