Load Pretrained GloVe, word2vec, and fastText Embeddings
load_embeddings.Rd
Loads pretrained word embeddings. If the specified model has already been
downloaded, it is read from file with read_embeddings()
. If not, the model
is retrieved from online sources and, by default, saved.
Arguments
- model
the name of a supported model
- dir
directory in which the model is or should be saved when
save = TRUE
. The default is the working directory,getwd()
. Dir can be set more permanently usingoptions(embeddings.model.path = dir)
.- words
optional list of words for which to retrieve embeddings.
- save
logical. Should the model be saved to
dir
if it does not already exist there?- format
the format in which the model should be saved if it does not exist already.
"original"
(the default) saves the file as is. Other options are"csv"
or"rds"
.
Details
The following are supported models for download. Note that some models are very large.
If you know in advance which word embeddings you will need (e.g. the set of unique
tokens in your corpus), consider specifying this with the words
parameter to save
memory and processing time.
GloVe
glove.42B.300d
: Common Crawl (42B tokens, 1.9M vocab, uncased, 300d). Downloaded from https://huggingface.co/stanfordnlp/glove. This file is a zip archive and must temporarily be downloaded in its entirety even whenwords
is specified.glove.840B.300d
: Common Crawl (840B tokens, 2.2M vocab, cased, 300d). Downloaded from https://huggingface.co/stanfordnlp/glove. This file is a zip archive and must temporarily be downloaded in its entirety even whenwords
is specified.glove.6B.50d
: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 50d). Downloaded from https://github.com/piskvorky/gensim-dataglove.6B.100d
: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 100d). Downloaded from https://github.com/piskvorky/gensim-dataglove.6B.200d
: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 200d). Downloaded from https://github.com/piskvorky/gensim-dataglove.6B.300d
: Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab, uncased, 300d). Downloaded from https://github.com/piskvorky/gensim-dataglove.twitter.27B.25d
: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d). Downloaded from https://github.com/piskvorky/gensim-dataglove.twitter.27B.50d
: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 50d). Downloaded from https://github.com/piskvorky/gensim-dataglove.twitter.27B.100d
: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 100d). Downloaded from https://github.com/piskvorky/gensim-dataglove.twitter.27B.200d
: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 200d). Downloaded from https://github.com/piskvorky/gensim-data
word2vec
Note that reading word2vec bin files may be slower than other formats. If
read time is a concern, consider setting format = "csv"
or format = "rds"
.
GoogleNews.vectors.negative300
: Trained with skip-gram on Google News (~100B tokens, 3M vocab, cased, 300d). Downloaded from https://github.com/piskvorky/gensim-data
ConceptNet Numberbatch
Multilingual word embeddings trained using an ensemble that combines data from word2vec, GloVe, OpenSubtitles, and the ConceptNet common sense knowledge database. Tokens are prefixed with language codes. For example, the English word "token" is labeled "/c/en/token". Downloaded from https://github.com/commonsense/conceptnet-numberbatch
numberbatch.19.08
: Multilingual (9.2M vocab, uncased, 300d)
fastText
300-dimensional word vectors for 157 languages, trained with CBOW on Common Crawl and Wikipedia. Downloaded from https://fasttext.cc/docs/en/crawl-vectors.html
cc.af.300
: Afrikaanscc.sq.300
: Albaniancc.als.300
: Alemanniccc.am.300
: Amhariccc.ar.300
: Arabiccc.an.300
: Aragonesecc.hy.300
: Armeniancc.as.300
: Assamesecc.ast.300
: Asturiancc.az.300
: Azerbaijanicc.ba.300
: Bashkircc.eu.300
: Basquecc.bar.300
: Bavariancc.be.300
: Belarusiancc.bn.300
: Bengalicc.bh.300
: Biharicc.bpy.300
: Bishnupriya Manipuricc.bs.300
: Bosniancc.br.300
: Bretoncc.bg.300
: Bulgariancc.my.300
: Burmesecc.ca.300
: Catalancc.ceb.300
: Cebuanocc.bcl.300
: Central Bicolanocc.ce.300
: Chechencc.zh.300
: Chinesecc.cv.300
: Chuvashcc.co.300
: Corsicancc.hr.300
: Croatiancc.cs.300
: Czechcc.da.300
: Danishcc.dv.300
: Divehicc.nl.300
: Dutchcc.pa.300
: Eastern Punjabicc.arz.300
: Egyptian Arabiccc.eml.300
: Emilian-Romagnolcc.en.300
: Englishcc.myv.300
: Erzyacc.eo.300
: Esperantocc.et.300
: Estoniancc.hif.300
: Fiji Hindicc.fi.300
: Finnishcc.fr.300
: Frenchcc.gl.300
: Galiciancc.ka.300
: Georgiancc.de.300
: Germancc.gom.300
: Goan Konkanicc.el.300
: Greekcc.gu.300
: Gujaraticc.ht.300
: Haitiancc.he.300
: Hebrewcc.mrj.300
: Hill Maricc.hi.300
: Hindicc.hu.300
: Hungariancc.is.300
: Icelandiccc.io.300
: Idocc.ilo.300
: Ilokanocc.id.300
: Indonesiancc.ia.300
: Interlinguacc.ga.300
: Irishcc.it.300
: Italiancc.ja.300
: Japanesecc.jv.300
: Javanesecc.kn.300
: Kannadacc.pam.300
: Kapampangancc.kk.300
: Kazakhcc.km.300
: Khmercc.ky.300
: Kirghizcc.ko.300
: Koreancc.ku.300
: Kurdish (Kurmanji)cc.ckb.300
: Kurdish (Sorani)cc.la.300
: Latincc.lv.300
: Latviancc.li.300
: Limburgishcc.lt.300
: Lithuaniancc.lmo.300
: Lombardcc.nds.300
: Low Saxoncc.lb.300
: Luxembourgishcc.mk.300
: Macedoniancc.mai.300
: Maithilicc.mg.300
: Malagasycc.ms.300
: Malaycc.ml.300
: Malayalamcc.mt.300
: Maltesecc.gv.300
: Manxcc.mr.300
: Marathicc.mzn.300
: Mazandaranicc.mhr.300
: Meadow Maricc.min.300
: Minangkabaucc.xmf.300
: Mingreliancc.mwl.300
: Mirandesecc.mn.300
: Mongoliancc.nah.300
: Nahuatlcc.nap.300
: Neapolitancc.ne.300
: Nepalicc.new.300
: Newarcc.frr.300
: North Frisiancc.nso.300
: Northern Sothocc.no.300
: Norwegian (Bokmål)cc.nn.300
: Norwegian (Nynorsk)cc.oc.300
: Occitancc.or.300
: Oriyacc.os.300
: Ossetiancc.pfl.300
: Palatinate Germancc.ps.300
: Pashtocc.fa.300
: Persiancc.pms.300
: Piedmontesecc.pl.300
: Polishcc.pt.300
: Portuguesecc.qu.300
: Quechuacc.ro.300
: Romaniancc.rm.300
: Romanshcc.ru.300
: Russiancc.sah.300
: Sakhacc.sa.300
: Sanskritcc.sc.300
: Sardiniancc.sco.300
: Scotscc.gd.300
: Scottish Gaeliccc.sr.300
: Serbiancc.sh.300
: Serbo-Croatiancc.scn.300
: Siciliancc.sd.300
: Sindhicc.si.300
: Sinhalesecc.sk.300
: Slovakcc.sl.300
: Sloveniancc.so.300
: Somalicc.azb.300
: Southern Azerbaijanicc.es.300
: Spanishcc.su.30
: Sundanesecc.sw.300
: Swahilicc.sv.300
: Swedishcc.tl.300
: Tagalogcc.tg.300
: Tajikcc.ta.300
: Tamilcc.tt.300
: Tatarcc.te.300
: Telugucc.th.300
: Thaicc.bo.300
: Tibetancc.tr.300
: Turkishcc.tk.300
: Turkmencc.uk.300
: Ukrainiancc.hsb.300
: Upper Sorbiancc.ur.300
: Urducc.ug.300
: Uyghurcc.uz.300
: Uzbekcc.vec.300
: Venetiancc.vi.300
: Vietnamesecc.vo.300
: Volapükcc.wa.300
: Wallooncc.war.300
: Waraycc.cy.300
: Welshcc.vls.300
: West Flemishcc.fy.300
: West Frisiancc.pnb.300
: Western Punjabicc.yi.300
: Yiddishcc.yo.300
: Yorubacc.diq.300
: Zazakicc.zea.300
: Zeelandic
References
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching Word Vectors with Subword Information. arXiv preprint. https://arxiv.org/abs/1607.04606
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR. https://arxiv.org/pdf/1301.3781
Pennington, J., Socher, R., and Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/
Speer, R., Chin, J., and Havasi, C. (2017). ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In proceedings of AAAI 2017. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972