Skip to contents

An embeddings object is a numeric matrix with fast indexing by rownames (generally tokens).

Usage

embeddings(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)

as.embeddings(x, ...)

# Default S3 method
as.embeddings(x, ..., rowname_repair = TRUE, rebuild_token_index = TRUE)

# S3 method for class 'data.frame'
as.embeddings(x, id_col = NULL, ..., rowname_repair = TRUE)

is.embeddings(x, ...)

Arguments

data

an optional data vector (including a list or expression vector). Non-atomic classed R objects are coerced by as.vector and all attributes discarded.

nrow

the desired number of rows.

ncol

the desired number of columns.

byrow

logical. If FALSE (the default) the matrix is filled by columns, otherwise the matrix is filled by rows.

dimnames

a dimnames attribute for the matrix: NULL or a list of length 2 giving the row and column names respectively. An empty list is treated as NULL, and a list of length one as row names. The list can be named, and the list names will be used as names for the dimensions.

x

A data frame to be converted into embeddings.

...

Additional arguments passed to or from other methods.

rowname_repair

logical. If TRUE (the default), check that unique rownames are provided, and name rows "doc_1", "doc_2", etc. if not.

rebuild_token_index

logical. If TRUE, the hash table index will be rebuilt even when x is an embeddings object.

id_col

Optional name of a column to take row names from.

Details

Fast row indexing is implemented using hash tables in native R environments. The "token_index" attribute of an embeddings object stores the environment that maps rownames to their corresponding indices.

If dimnames is not supplied, embeddings will automatically name rows doc_1, doc_2, etc., and columns dim_1, dim_2, etc.

Examples

random_mat <- matrix(
  sample(1:10, 20, replace = TRUE),
  nrow = 2,
  dimnames = list(c("happy", "sad"))
  )
random_embeddings <- as.embeddings(random_mat)
is.embeddings(random_embeddings[,2:5])
#> [1] TRUE

tibble::as_tibble(random_embeddings, rownames = "token")
#> # A tibble: 2 × 11
#>   token dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 dim_8 dim_9 dim_10
#>   <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>  <int>
#> 1 happy     7     6     6     5     8     7     5     6     2      9
#> 2 sad       5     4     9     5     2     5     2     4     3      6