Average Embeddings
average_embedding.Rd
Calculate the (weighted) average of multiple embeddings.
Arguments
- x
an embeddings object or list of embeddings objects
- w
optional weighting for rows in x. This can be an unnamed numeric vector with one item per row of x, a named numeric vector of any length with names that match the row names of x,
"trillion_word"
(125,000 English word frequencies from Peter Norvig's compilation, derived from the Google Web Trillion Word Corpus), or"trillion_word_sif"
for smooth inverse frequencies (SIF) calculated using the same list.- method
method to use for averaging.
"mean"
(the default) is the standard arithmetic mean."median"
is the geometric median (also called spatial median or L1-median), computed usingGmedian::Gmedian()
or, if weights are provided,Gmedian::Weiszfeld()
."sum"
is the (weighted) sum.- ...
additional arguments to be passed to the averaging function
Details
For w = "trillion_word"
or w = "trillion_word_sif"
, tokens
that do not appear in the word frequency list are treated as if they appeared
as often as the least frequent word in the list. If w
is a named vector,
rows that do not match any items in the vector will be assigned the minimum
value of that vector.
Value
A named numeric vector. If x
is a list, the function will be called
recursively and output a list of the same length.
Examples
happy_dict <- c("happy", "joy", "smile", "enjoy")
happy_dict_embeddings <- predict(glove_twitter_25d, happy_dict)
happy_dict_vec <- average_embedding(happy_dict_embeddings)
happy_dict_vec_weighted <- average_embedding(happy_dict_embeddings, w = "trillion_word")
happy_dict_list <- find_nearest(glove_twitter_25d, happy_dict, each = TRUE)
happy_dict_vec_list <- average_embedding(happy_dict_list)