Skip to contents

Loads pretrained word embeddings. If the specified model has already been downloaded, it is read from file with read_embeddings(). If not, the model is retrieved from online sources and, by default, saved.

Usage

load_embeddings(
  model,
  dir = NULL,
  words = NULL,
  save = TRUE,
  format = "original"
)

Arguments

model

the name of a supported model

dir

directory in which the model is or should be saved when save = TRUE. The default is the working directory, getwd(). Dir can be set more permanently using options(embeddings.model.path = dir).

words

optional list of words for which to retrieve embeddings.

save

logical. Should the model be saved to dir if it does not already exist there?

format

the format in which the model should be saved if it does not exist already. "original" (the default) saves the file as is. Other options are "csv" or "rds".

Details

The following are supported models for download. Note that some models are very large. If you know in advance which word embeddings you will need (e.g. the set of unique tokens in your corpus), consider specifying this with the words parameter to save memory and processing time.

GloVe

  • glove.42B.300d: Common Crawl (42B tokens, 1.9M vocab, uncased, 300d). Downloaded from https://huggingface.co/stanfordnlp/glove. This file is a zip archive and must temporarily be downloaded in its entirety even when words is specified.

  • glove.840B.300d: Common Crawl (840B tokens, 2.2M vocab, cased, 300d). Downloaded from https://huggingface.co/stanfordnlp/glove. This file is a zip archive and must temporarily be downloaded in its entirety even when words is specified.

  • glove.6B.50d: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d). Downloaded from https://github.com/piskvorky/gensim-data

  • glove.6B.100d: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 100d). Downloaded from https://github.com/piskvorky/gensim-data

  • glove.6B.200d: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 200d). Downloaded from https://github.com/piskvorky/gensim-data

  • glove.6B.300d: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d). Downloaded from https://github.com/piskvorky/gensim-data

  • glove.twitter.27B.25d: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d). Downloaded from https://github.com/piskvorky/gensim-data

  • glove.twitter.27B.50d: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 50d). Downloaded from https://github.com/piskvorky/gensim-data

  • glove.twitter.27B.100d: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 100d). Downloaded from https://github.com/piskvorky/gensim-data

  • glove.twitter.27B.200d: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 200d). Downloaded from https://github.com/piskvorky/gensim-data

word2vec

Note that reading word2vec bin files may be slower than other formats. If read time is a concern, consider setting format = "csv" or format = "rds".

  • GoogleNews.vectors.negative300: Trained with skip-gram on Google News (~100B tokens, 3M vocab, cased, 300d). Downloaded from https://github.com/piskvorky/gensim-data

ConceptNet Numberbatch

Multilingual word embeddings trained using an ensemble that combines data from word2vec, GloVe, OpenSubtitles, and the ConceptNet common sense knowledge database. Tokens are prefixed with language codes. For example, the English word "token" is labeled "/c/en/token". Downloaded from https://github.com/commonsense/conceptnet-numberbatch

  • numberbatch.19.08: Multilingual (9.2M vocab, uncased, 300d)

fastText

300-dimensional word vectors for 157 languages, trained with CBOW on Common Crawl and Wikipedia. Downloaded from https://fasttext.cc/docs/en/crawl-vectors.html

  • cc.af.300: Afrikaans

  • cc.sq.300: Albanian

  • cc.als.300: Alemannic

  • cc.am.300: Amharic

  • cc.ar.300: Arabic

  • cc.an.300: Aragonese

  • cc.hy.300: Armenian

  • cc.as.300: Assamese

  • cc.ast.300: Asturian

  • cc.az.300: Azerbaijani

  • cc.ba.300: Bashkir

  • cc.eu.300: Basque

  • cc.bar.300: Bavarian

  • cc.be.300: Belarusian

  • cc.bn.300: Bengali

  • cc.bh.300: Bihari

  • cc.bpy.300: Bishnupriya Manipuri

  • cc.bs.300: Bosnian

  • cc.br.300: Breton

  • cc.bg.300: Bulgarian

  • cc.my.300: Burmese

  • cc.ca.300: Catalan

  • cc.ceb.300: Cebuano

  • cc.bcl.300: Central Bicolano

  • cc.ce.300: Chechen

  • cc.zh.300: Chinese

  • cc.cv.300: Chuvash

  • cc.co.300: Corsican

  • cc.hr.300: Croatian

  • cc.cs.300: Czech

  • cc.da.300: Danish

  • cc.dv.300: Divehi

  • cc.nl.300: Dutch

  • cc.pa.300: Eastern Punjabi

  • cc.arz.300: Egyptian Arabic

  • cc.eml.300: Emilian-Romagnol

  • cc.en.300: English

  • cc.myv.300: Erzya

  • cc.eo.300: Esperanto

  • cc.et.300: Estonian

  • cc.hif.300: Fiji Hindi

  • cc.fi.300: Finnish

  • cc.fr.300: French

  • cc.gl.300: Galician

  • cc.ka.300: Georgian

  • cc.de.300: German

  • cc.gom.300: Goan Konkani

  • cc.el.300: Greek

  • cc.gu.300: Gujarati

  • cc.ht.300: Haitian

  • cc.he.300: Hebrew

  • cc.mrj.300: Hill Mari

  • cc.hi.300: Hindi

  • cc.hu.300: Hungarian

  • cc.is.300: Icelandic

  • cc.io.300: Ido

  • cc.ilo.300: Ilokano

  • cc.id.300: Indonesian

  • cc.ia.300: Interlingua

  • cc.ga.300: Irish

  • cc.it.300: Italian

  • cc.ja.300: Japanese

  • cc.jv.300: Javanese

  • cc.kn.300: Kannada

  • cc.pam.300: Kapampangan

  • cc.kk.300: Kazakh

  • cc.km.300: Khmer

  • cc.ky.300: Kirghiz

  • cc.ko.300: Korean

  • cc.ku.300: Kurdish (Kurmanji)

  • cc.ckb.300: Kurdish (Sorani)

  • cc.la.300: Latin

  • cc.lv.300: Latvian

  • cc.li.300: Limburgish

  • cc.lt.300: Lithuanian

  • cc.lmo.300: Lombard

  • cc.nds.300: Low Saxon

  • cc.lb.300: Luxembourgish

  • cc.mk.300: Macedonian

  • cc.mai.300: Maithili

  • cc.mg.300: Malagasy

  • cc.ms.300: Malay

  • cc.ml.300: Malayalam

  • cc.mt.300: Maltese

  • cc.gv.300: Manx

  • cc.mr.300: Marathi

  • cc.mzn.300: Mazandarani

  • cc.mhr.300: Meadow Mari

  • cc.min.300: Minangkabau

  • cc.xmf.300: Mingrelian

  • cc.mwl.300: Mirandese

  • cc.mn.300: Mongolian

  • cc.nah.300: Nahuatl

  • cc.nap.300: Neapolitan

  • cc.ne.300: Nepali

  • cc.new.300: Newar

  • cc.frr.300: North Frisian

  • cc.nso.300: Northern Sotho

  • cc.no.300: Norwegian (Bokmål)

  • cc.nn.300: Norwegian (Nynorsk)

  • cc.oc.300: Occitan

  • cc.or.300: Oriya

  • cc.os.300: Ossetian

  • cc.pfl.300: Palatinate German

  • cc.ps.300: Pashto

  • cc.fa.300: Persian

  • cc.pms.300: Piedmontese

  • cc.pl.300: Polish

  • cc.pt.300: Portuguese

  • cc.qu.300: Quechua

  • cc.ro.300: Romanian

  • cc.rm.300: Romansh

  • cc.ru.300: Russian

  • cc.sah.300: Sakha

  • cc.sa.300: Sanskrit

  • cc.sc.300: Sardinian

  • cc.sco.300: Scots

  • cc.gd.300: Scottish Gaelic

  • cc.sr.300: Serbian

  • cc.sh.300: Serbo-Croatian

  • cc.scn.300: Sicilian

  • cc.sd.300: Sindhi

  • cc.si.300: Sinhalese

  • cc.sk.300: Slovak

  • cc.sl.300: Slovenian

  • cc.so.300: Somali

  • cc.azb.300: Southern Azerbaijani

  • cc.es.300: Spanish

  • cc.su.30: Sundanese

  • cc.sw.300: Swahili

  • cc.sv.300: Swedish

  • cc.tl.300: Tagalog

  • cc.tg.300: Tajik

  • cc.ta.300: Tamil

  • cc.tt.300: Tatar

  • cc.te.300: Telugu

  • cc.th.300: Thai

  • cc.bo.300: Tibetan

  • cc.tr.300: Turkish

  • cc.tk.300: Turkmen

  • cc.uk.300: Ukrainian

  • cc.hsb.300: Upper Sorbian

  • cc.ur.300: Urdu

  • cc.ug.300: Uyghur

  • cc.uz.300: Uzbek

  • cc.vec.300: Venetian

  • cc.vi.300: Vietnamese

  • cc.vo.300: Volapük

  • cc.wa.300: Walloon

  • cc.war.300: Waray

  • cc.cy.300: Welsh

  • cc.vls.300: West Flemish

  • cc.fy.300: West Frisian

  • cc.pnb.300: Western Punjabi

  • cc.yi.300: Yiddish

  • cc.yo.300: Yoruba

  • cc.diq.300: Zazaki

  • cc.zea.300: Zeelandic

Value

An embeddings object (a numeric matrix with tokens as rownames)

References

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching Word Vectors with Subword Information. arXiv preprint. https://arxiv.org/abs/1607.04606

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR. https://arxiv.org/pdf/1301.3781

Pennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/

Speer, R., Chin, J., and Havasi, C. (2017). ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In proceedings of AAAI 2017. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972