`models._fasttext_bin` – Facebook’s fastText I/O¶

Load models from the native binary format released by Facebook.

The main entry point is the load() function. It returns a Model namedtuple containing everything loaded from the binary.

Examples

Load a model from a binary file:

>>> from gensim.test.utils import datapath
>>> from gensim.models.fasttext_bin import load
>>> with open(datapath('crime-and-punishment.bin'), 'rb') as fin:
...     model = load(fin)
>>> model.nwords
291
>>> model.vectors_ngrams.shape
(391, 5)
>>> sorted(model.raw_vocab, key=lambda w: len(w), reverse=True)[:5]
['останавливаться', 'изворачиваться,', 'раздражительном', 'exceptionally', 'проскользнуть']

See also

FB Implementation.

class gensim.models._fasttext_bin.Model(bucket, dim, epoch, hidden_output, loss, lr_update_rate, maxn, min_count, minn, model, neg, ntokens, nwords, raw_vocab, t, vectors_ngrams, vocab_size, word_ngrams, ws)¶

Bases: tuple

Holds data loaded from the Facebook binary.

Parameters

dim (int) – The dimensionality of the vectors.
ws (int) – The window size.
epoch (int) – The number of training epochs.
neg (int) – If non-zero, indicates that the model uses negative sampling.
loss (int) – If equal to 1, indicates that the model uses hierarchical sampling.
model (int) – If equal to 2, indicates that the model uses skip-grams.
bucket (int) – The number of buckets.
min_count (int) – The threshold below which the model ignores terms.
t (float) – The sample threshold.
minn (int) – The minimum ngram length.
maxn (int) – The maximum ngram length.
raw_vocab (collections.OrderedDict) – A map from words (str) to their frequency (int). The order in the dict corresponds to the order of the words in the Facebook binary.
nwords (int) – The number of words.
vocab_size (int) – The size of the vocabulary.
vectors_ngrams (numpy.array) – This is a matrix that contains vectors learned by the model. Each row corresponds to a vector. The number of vectors is equal to the number of words plus the number of buckets. The number of columns is equal to the vector dimensionality.
hidden_output (numpy.array) – This is a matrix that contains the shallow neural network output. This array has the same dimensions as vectors_ngrams. May be None - in that case, it is impossible to continue training the model.

__getitem__()¶: Return self[key].

bucket¶: Alias for field number 0

count()¶: Return number of occurrences of value.

dim¶: Alias for field number 1

epoch¶: Alias for field number 2

hidden_output¶: Alias for field number 3

index()¶

Return first index of value.

Raises ValueError if the value is not present.

loss¶: Alias for field number 4

lr_update_rate¶: Alias for field number 5

maxn¶: Alias for field number 6

min_count¶: Alias for field number 7

minn¶: Alias for field number 8

model¶: Alias for field number 9

neg¶: Alias for field number 10

ntokens¶: Alias for field number 11

nwords¶: Alias for field number 12

raw_vocab¶: Alias for field number 13

t¶: Alias for field number 14

vectors_ngrams¶: Alias for field number 15

vocab_size¶: Alias for field number 16

word_ngrams¶: Alias for field number 17

ws¶: Alias for field number 18

gensim.models._fasttext_bin.load(fin, encoding='utf-8', full_model=True)¶

Load a model from a binary stream.

Parameters

fin (file) – The readable binary stream.
encoding (str, optional) – The encoding to use for decoding text
full_model (boolean, optional) – If False, skips loading the hidden output matrix. This saves a fair bit of CPU time and RAM, but prevents training continuation.

Returns

The loaded model.

Return type

Model

gensim.models._fasttext_bin.save(model, fout, fb_fasttext_parameters, encoding)¶

Saves word embeddings to the Facebook’s native fasttext .bin format.

Parameters

fout (file name or writeable binary stream) – stream to which model is saved
model (gensim.models.fasttext.FastText) – saved model
fb_fasttext_parameters (dictionary) – dictionary contain parameters containing lr_update_rate, word_ngrams unused by gensim implementation, so they have to be provided externally
encoding (str) – encoding used in the output file

Notes

Unfortunately, there is no documentation of the Facebook’s native fasttext .bin format

This is just reimplementation of [FastText::saveModel](https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc)

Based on v0.9.1, more precisely commit da2745fcccb848c7a225a7d558218ee4c64d5333

Code follows the original C++ code naming.

Get Expert Help From The Gensim Authors

models._fasttext_bin – Facebook’s fastText I/O¶

`models._fasttext_bin` – Facebook’s fastText I/O¶