models.fasttext_inner – Cython routines for training FastText models

Optimized Cython functions for training a FastText model.

The main entry point is train_batch_any() which may be called directly from Python code.

Notes

The implementation of the above functions heavily depends on the FastTextConfig struct defined in gensim/models/fasttext_inner.pxd.

The gensim.models.word2vec.FAST_VERSION value reports what flavor of BLAS we’re currently using:

0: double 1: float 2: no BLAS, use Cython loops instead

gensim.models.fasttext_inner.compute_ngrams(word, unsigned int min_n, unsigned int max_n)

Get the list of all possible ngrams for a given word.

Parameters
  • word (str) – The word whose ngrams need to be computed.

  • min_n (unsigned int) – Minimum character length of the ngrams.

  • max_n (unsigned int) – Maximum character length of the ngrams.

Returns

Sequence of character ngrams.

Return type

list of str

gensim.models.fasttext_inner.compute_ngrams_bytes(word, unsigned int min_n, unsigned int max_n)

Computes ngrams for a word.

Ported from the original FB implementation.

Parameters
  • word (str) – A unicode string.

  • min_n (unsigned int) – The minimum ngram length.

  • max_n (unsigned int) – The maximum ngram length.

  • Returns

  • --------

  • of str (list) – A list of ngrams, where each ngram is a list of bytes.

gensim.models.fasttext_inner.ft_hash_bytes(bytes bytez)

Calculate hash based on bytez. Reproduce hash method from Facebook fastText implementation.

Parameters

bytez (bytes) – The string whose hash needs to be calculated, encoded as UTF-8.

Returns

The hash of the string.

Return type

unsigned int

gensim.models.fasttext_inner.init()

Precompute function sigmoid(x) = 1 / (1 + exp(-x)), for x values discretized into table EXP_TABLE. Also calculate log(sigmoid(x)) into LOG_TABLE.

We recalc, rather than re-use the table from word2vec_inner, because Facebook’s FastText code uses a 512-slot table rather than the 1000 precedent of word2vec.c.

gensim.models.fasttext_inner.train_batch_any(model, sentences, alpha, _work, _neu1)

Update the model by training on a sequence of sentences.

Each sentence is a list of string tokens, which are looked up in the model’s vocab dictionary. Called internally from train().

Parameters
  • model (FastText) – Model to be trained.

  • sentences (iterable of list of str) – A single batch: part of the corpus streamed directly from disk/network.

  • alpha (float) – Learning rate.

  • _work (np.ndarray) – Private working memory for each worker.

  • _neu1 (np.ndarray) – Private working memory for each worker.

Returns

Effective number of words trained.

Return type

int