downloader – Downloader API for gensim

This module is an API for downloading, getting information and loading datasets/models.

See RaRe-Technologies/gensim-data repo for more information about models/datasets/how-to-add-new/etc.

Give information about available models/datasets:

>>> import gensim.downloader as api
>>>
>>> api.info()  # return dict with info about available models/datasets
>>> api.info("text8")  # return dict with info about "text8" dataset

Model example:

>>> import gensim.downloader as api
>>>
>>> model = api.load("glove-twitter-25")  # load glove vectors
>>> model.most_similar("cat")  # show words that similar to word 'cat'

Dataset example:

>>> import gensim.downloader as api
>>> from gensim.models import Word2Vec
>>>
>>> dataset = api.load("text8")  # load dataset as iterable
>>> model = Word2Vec(dataset)  # train w2v model

Also, this API available via CLI:

python -m gensim.downloader --info <dataname> # same as api.info(dataname)
python -m gensim.downloader --info name # same as api.info(name_only=True)
python -m gensim.downloader --download <dataname> # same as api.load(dataname, return_path=True)

You may specify the local subdirectory for saving gensim data using the GENSIM_DATA_DIR environment variable. For example:

$ export GENSIM_DATA_DIR=/tmp/gensim-data $ python -m gensim.downloader –download <dataname>

By default, this subdirectory is ~/gensim-data.

gensim.downloader.BASE_DIR = '/Users/vaclavdvorak/gensim-data'

The default location to store downloaded data.

You may override this with the GENSIM_DATA_DIR environment variable.

gensim.downloader.info(name=None, show_only_latest=True, name_only=False)

Provide the information related to model/dataset.

Parameters
  • name (str, optional) – Name of model/dataset. If not set - shows all available data.

  • show_only_latest (bool, optional) – If storage contains different versions for one data/model, this flag allow to hide outdated versions. Affects only if name is None.

  • name_only (bool, optional) – If True, will return only the names of available models and corpora.

Returns

Detailed information about one or all models/datasets. If name is specified, return full information about concrete dataset/model, otherwise, return information about all available datasets/models.

Return type

dict

Raises

Exception – If name that has been passed is incorrect.

Examples

>>> import gensim.downloader as api
>>> api.info("text8")  # retrieve information about text8 dataset
{u'checksum': u'68799af40b6bda07dfa47a32612e5364',
 u'description': u'Cleaned small sample from wikipedia',
 u'file_name': u'text8.gz',
 u'parts': 1,
 u'source': u'http://mattmahoney.net/dc/text8.zip'}
>>>
>>> api.info()  # retrieve information about all available datasets and models
gensim.downloader.load(name, return_path=False)

Download (if needed) dataset/model and load it to memory (unless return_path is set).

Parameters
  • name (str) – Name of the model/dataset.

  • return_path (bool, optional) – If True, return full path to file, otherwise, return loaded model / iterable dataset.

Returns

  • Model – Requested model, if name is model and return_path == False.

  • Dataset (iterable) – Requested dataset, if name is dataset and return_path == False.

  • str – Path to file with dataset / model, only when return_path == True.

Raises

Exception – Raised if name is incorrect.

Examples

Model example:

>>> import gensim.downloader as api
>>>
>>> model = api.load("glove-twitter-25")  # load glove vectors
>>> model.most_similar("cat")  # show words that similar to word 'cat'

Dataset example:

>>> import gensim.downloader as api
>>>
>>> wiki = api.load("wiki-en")  # load extracted Wikipedia dump, around 6 Gb
>>> for article in wiki:  # iterate over all wiki script
>>>     pass

Download only example:

>>> import gensim.downloader as api
>>>
>>> print(api.load("wiki-en", return_path=True))  # output: /home/user/gensim-data/wiki-en/wiki-en.gz