models._fasttext_bin – Facebook’s fastText I/O

Load models from the native binary format released by Facebook.

The main entry point is the load() function. It returns a Model namedtuple containing everything loaded from the binary.

Examples

Load a model from a binary file:

>>> from gensim.test.utils import datapath
>>> from gensim.models.fasttext_bin import load
>>> with open(datapath('crime-and-punishment.bin'), 'rb') as fin:
...     model = load(fin)
>>> model.nwords
291
>>> model.vectors_ngrams.shape
(391, 5)
>>> sorted(model.raw_vocab, key=lambda w: len(w), reverse=True)[:5]
['останавливаться', 'изворачиваться,', 'раздражительном', 'exceptionally', 'проскользнуть']

See also

FB Implementation.

class gensim.models._fasttext_bin.Model(bucket, dim, epoch, hidden_output, loss, lr_update_rate, maxn, min_count, minn, model, neg, ntokens, nwords, raw_vocab, t, vectors_ngrams, vocab_size, word_ngrams, ws)

Bases: tuple

Holds data loaded from the Facebook binary.

Parameters
  • dim (int) – The dimensionality of the vectors.

  • ws (int) – The window size.

  • epoch (int) – The number of training epochs.

  • neg (int) – If non-zero, indicates that the model uses negative sampling.

  • loss (int) – If equal to 1, indicates that the model uses hierarchical sampling.

  • model (int) – If equal to 2, indicates that the model uses skip-grams.

  • bucket (int) – The number of buckets.

  • min_count (int) – The threshold below which the model ignores terms.

  • t (float) – The sample threshold.

  • minn (int) – The minimum ngram length.

  • maxn (int) – The maximum ngram length.

  • raw_vocab (collections.OrderedDict) – A map from words (str) to their frequency (int). The order in the dict corresponds to the order of the words in the Facebook binary.

  • nwords (int) – The number of words.

  • vocab_size (int) – The size of the vocabulary.

  • vectors_ngrams (numpy.array) – This is a matrix that contains vectors learned by the model. Each row corresponds to a vector. The number of vectors is equal to the number of words plus the number of buckets. The number of columns is equal to the vector dimensionality.

  • hidden_output (numpy.array) – This is a matrix that contains the shallow neural network output. This array has the same dimensions as vectors_ngrams. May be None - in that case, it is impossible to continue training the model.

__getitem__()

Return self[key].

bucket

Alias for field number 0

count()

Return number of occurrences of value.

dim

Alias for field number 1

epoch

Alias for field number 2

hidden_output

Alias for field number 3

index()

Return first index of value.

Raises ValueError if the value is not present.

loss

Alias for field number 4

lr_update_rate

Alias for field number 5

maxn

Alias for field number 6

min_count

Alias for field number 7

minn

Alias for field number 8

model

Alias for field number 9

neg

Alias for field number 10

ntokens

Alias for field number 11

nwords

Alias for field number 12

raw_vocab

Alias for field number 13

t

Alias for field number 14

vectors_ngrams

Alias for field number 15

vocab_size

Alias for field number 16

word_ngrams

Alias for field number 17

ws

Alias for field number 18

gensim.models._fasttext_bin.load(fin, encoding='utf-8', full_model=True)

Load a model from a binary stream.

Parameters
  • fin (file) – The readable binary stream.

  • encoding (str, optional) – The encoding to use for decoding text

  • full_model (boolean, optional) – If False, skips loading the hidden output matrix. This saves a fair bit of CPU time and RAM, but prevents training continuation.

Returns

The loaded model.

Return type

Model

gensim.models._fasttext_bin.save(model, fout, fb_fasttext_parameters, encoding)

Saves word embeddings to the Facebook’s native fasttext .bin format.

Parameters
  • fout (file name or writeable binary stream) – stream to which model is saved

  • model (gensim.models.fasttext.FastText) – saved model

  • fb_fasttext_parameters (dictionary) – dictionary contain parameters containing lr_update_rate, word_ngrams unused by gensim implementation, so they have to be provided externally

  • encoding (str) – encoding used in the output file

Notes

Unfortunately, there is no documentation of the Facebook’s native fasttext .bin format

This is just reimplementation of [FastText::saveModel](https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc)

Based on v0.9.1, more precisely commit da2745fcccb848c7a225a7d558218ee4c64d5333

Code follows the original C++ code naming.