models._fasttext_bin
– Facebook’s fastText I/O¶
Load models from the native binary format released by Facebook.
The main entry point is the load()
function.
It returns a Model
namedtuple containing everything loaded from the binary.
Examples
Load a model from a binary file:
>>> from gensim.test.utils import datapath
>>> from gensim.models.fasttext_bin import load
>>> with open(datapath('crime-and-punishment.bin'), 'rb') as fin:
... model = load(fin)
>>> model.nwords
291
>>> model.vectors_ngrams.shape
(391, 5)
>>> sorted(model.raw_vocab, key=lambda w: len(w), reverse=True)[:5]
['останавливаться', 'изворачиваться,', 'раздражительном', 'exceptionally', 'проскользнуть']
See also
-
class
gensim.models._fasttext_bin.
Model
(bucket, dim, epoch, hidden_output, loss, lr_update_rate, maxn, min_count, minn, model, neg, ntokens, nwords, raw_vocab, t, vectors_ngrams, vocab_size, word_ngrams, ws)¶ Bases:
tuple
Holds data loaded from the Facebook binary.
- Parameters
dim (int) – The dimensionality of the vectors.
ws (int) – The window size.
epoch (int) – The number of training epochs.
neg (int) – If non-zero, indicates that the model uses negative sampling.
loss (int) – If equal to 1, indicates that the model uses hierarchical sampling.
model (int) – If equal to 2, indicates that the model uses skip-grams.
bucket (int) – The number of buckets.
min_count (int) – The threshold below which the model ignores terms.
t (float) – The sample threshold.
minn (int) – The minimum ngram length.
maxn (int) – The maximum ngram length.
raw_vocab (collections.OrderedDict) – A map from words (str) to their frequency (int). The order in the dict corresponds to the order of the words in the Facebook binary.
nwords (int) – The number of words.
vocab_size (int) – The size of the vocabulary.
vectors_ngrams (numpy.array) – This is a matrix that contains vectors learned by the model. Each row corresponds to a vector. The number of vectors is equal to the number of words plus the number of buckets. The number of columns is equal to the vector dimensionality.
hidden_output (numpy.array) – This is a matrix that contains the shallow neural network output. This array has the same dimensions as vectors_ngrams. May be None - in that case, it is impossible to continue training the model.
-
__getitem__
()¶ Return self[key].
-
bucket
¶ Alias for field number 0
-
count
()¶ Return number of occurrences of value.
-
dim
¶ Alias for field number 1
-
epoch
¶ Alias for field number 2
Alias for field number 3
-
index
()¶ Return first index of value.
Raises ValueError if the value is not present.
-
loss
¶ Alias for field number 4
-
lr_update_rate
¶ Alias for field number 5
-
maxn
¶ Alias for field number 6
-
min_count
¶ Alias for field number 7
-
minn
¶ Alias for field number 8
-
model
¶ Alias for field number 9
-
neg
¶ Alias for field number 10
-
ntokens
¶ Alias for field number 11
-
nwords
¶ Alias for field number 12
-
raw_vocab
¶ Alias for field number 13
-
t
¶ Alias for field number 14
-
vectors_ngrams
¶ Alias for field number 15
-
vocab_size
¶ Alias for field number 16
-
word_ngrams
¶ Alias for field number 17
-
ws
¶ Alias for field number 18
-
gensim.models._fasttext_bin.
load
(fin, encoding='utf-8', full_model=True)¶ Load a model from a binary stream.
- Parameters
fin (file) – The readable binary stream.
encoding (str, optional) – The encoding to use for decoding text
full_model (boolean, optional) – If False, skips loading the hidden output matrix. This saves a fair bit of CPU time and RAM, but prevents training continuation.
- Returns
The loaded model.
- Return type
-
gensim.models._fasttext_bin.
save
(model, fout, fb_fasttext_parameters, encoding)¶ Saves word embeddings to the Facebook’s native fasttext .bin format.
- Parameters
fout (file name or writeable binary stream) – stream to which model is saved
model (gensim.models.fasttext.FastText) – saved model
fb_fasttext_parameters (dictionary) – dictionary contain parameters containing lr_update_rate, word_ngrams unused by gensim implementation, so they have to be provided externally
encoding (str) – encoding used in the output file
Notes
Unfortunately, there is no documentation of the Facebook’s native fasttext .bin format
This is just reimplementation of [FastText::saveModel](https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc)
Based on v0.9.1, more precisely commit da2745fcccb848c7a225a7d558218ee4c64d5333
Code follows the original C++ code naming.