models.nmf – Non-Negative Matrix factorization

Online Non-Negative Matrix Factorization. Implementation of the efficient incremental algorithm of Renbo Zhao, Vincent Y. F. Tan et al. [PDF].

This NMF implementation updates in a streaming fashion and works best with sparse corpora.

  • W is a word-topic matrix

  • h is a topic-document matrix

  • v is an input corpus batch, word-document matrix

  • A, B - matrices that accumulate information from every consecutive chunk. A = h.dot(ht), B = v.dot(ht).

The idea of the algorithm is as follows:

Initialize W, A and B matrices

Input the corpus
Split the corpus into batches

for v in batches:
    infer h:
        do coordinate gradient descent step to find h that minimizes (v - Wh) l2 norm

        bound h so that it is non-negative

    update A and B:
        A = h.dot(ht)
        B = v.dot(ht)

    update W:
        do gradient descent step to find W that minimizes 0.5*trace(WtWA) - trace(WtB) l2 norm

Examples

Train an NMF model using a Gensim corpus

>>> from gensim.test.utils import common_texts
>>> from gensim.corpora.dictionary import Dictionary
>>>
>>> # Create a corpus from a list of texts
>>> common_dictionary = Dictionary(common_texts)
>>> common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
>>>
>>> # Train the model on the corpus.
>>> nmf = Nmf(common_corpus, num_topics=10)

Save a model to disk, or reload a pre-trained model

>>> from gensim.test.utils import datapath
>>>
>>> # Save model to disk.
>>> temp_file = datapath("model")
>>> nmf.save(temp_file)
>>>
>>> # Load a potentially pretrained model from disk.
>>> nmf = Nmf.load(temp_file)

Infer vectors for new documents

>>> # Create a new corpus, made of previously unseen documents.
>>> other_texts = [
...     ['computer', 'time', 'graph'],
...     ['survey', 'response', 'eps'],
...     ['human', 'system', 'computer']
... ]
>>> other_corpus = [common_dictionary.doc2bow(text) for text in other_texts]
>>>
>>> unseen_doc = other_corpus[0]
>>> vector = Nmf[unseen_doc]  # get topic probability distribution for a document

Update the model by incrementally training on the new corpus

>>> nmf.update(other_corpus)
>>> vector = nmf[unseen_doc]

A lot of parameters can be tuned to optimize training for your specific case

>>> nmf = Nmf(common_corpus, num_topics=50, kappa=0.1, eval_every=5)  # decrease training step size

The NMF should be used whenever one needs extremely fast and memory optimized topic model.

class gensim.models.nmf.Nmf(corpus=None, num_topics=100, id2word=None, chunksize=2000, passes=1, kappa=1.0, minimum_probability=0.01, w_max_iter=200, w_stop_condition=0.0001, h_max_iter=50, h_stop_condition=0.001, eval_every=10, normalize=True, random_state=None)

Bases: gensim.interfaces.TransformationABC, gensim.models.basemodel.BaseTopicModel

Online Non-Negative Matrix Factorization.

Renbo Zhao et al :”Online Nonnegative Matrix Factorization with Outliers”

Parameters
  • corpus (iterable of list of (int, float) or csc_matrix with the shape (n_tokens, n_documents), optional) – Training corpus. Can be either iterable of documents, which are lists of (word_id, word_count), or a sparse csc matrix of BOWs for each document. If not specified, the model is left uninitialized (presumably, to be trained later with self.train()).

  • num_topics (int, optional) – Number of topics to extract.

  • id2word ({dict of (int, str), gensim.corpora.dictionary.Dictionary}) – Mapping from word IDs to words. It is used to determine the vocabulary size, as well as for debugging and topic printing.

  • chunksize (int, optional) – Number of documents to be used in each training chunk.

  • passes (int, optional) – Number of full passes over the training corpus. Leave at default passes=1 if your input is an iterator.

  • kappa (float, optional) – Gradient descent step size. Larger value makes the model train faster, but could lead to non-convergence if set too large.

  • minimum_probability – If normalize is True, topics with smaller probabilities are filtered out. If normalize is False, topics with smaller factors are filtered out. If set to None, a value of 1e-8 is used to prevent 0s.

  • w_max_iter (int, optional) – Maximum number of iterations to train W per each batch.

  • w_stop_condition (float, optional) – If error difference gets less than that, training of W stops for the current batch.

  • h_max_iter (int, optional) – Maximum number of iterations to train h per each batch.

  • h_stop_condition (float) – If error difference gets less than that, training of h stops for the current batch.

  • eval_every (int, optional) – Number of batches after which l2 norm of (v - Wh) is computed. Decreases performance if set too low.

  • normalize (bool or None, optional) – Whether to normalize the result. Allows for estimation of perplexity, coherence, e.t.c.

  • random_state ({np.random.RandomState, int}, optional) – Seed for random generator. Needed for reproducibility.

get_document_topics(bow, minimum_probability=None, normalize=None)

Get the topic distribution for the given document.

Parameters
  • bow (list of (int, float)) – The document in BOW format.

  • minimum_probability (float) – If normalize is True, topics with smaller probabilities are filtered out. If normalize is False, topics with smaller factors are filtered out. If set to None, a value of 1e-8 is used to prevent 0s.

  • normalize (bool or None, optional) – Whether to normalize the result. Allows for estimation of perplexity, coherence, e.t.c.

Returns

Topic distribution for the whole document. Each element in the list is a pair of a topic’s id, and the probability that was assigned to it.

Return type

list of (int, float)

get_term_topics(word_id, minimum_probability=None, normalize=None)

Get the most relevant topics to the given word.

Parameters
  • word_id (int) – The word for which the topic distribution will be computed.

  • minimum_probability (float, optional) – If normalize is True, topics with smaller probabilities are filtered out. If normalize is False, topics with smaller factors are filtered out. If set to None, a value of 1e-8 is used to prevent 0s.

  • normalize (bool or None, optional) – Whether to normalize the result. Allows for estimation of perplexity, coherence, e.t.c.

Returns

The relevant topics represented as pairs of their ID and their assigned probability, sorted by relevance to the given word.

Return type

list of (int, float)

get_topic_terms(topicid, topn=10, normalize=None)

Get the representation for a single topic. Words the integer IDs, in constrast to show_topic() that represents words by the actual strings.

Parameters
  • topicid (int) – The ID of the topic to be returned

  • topn (int, optional) – Number of the most significant words that are associated with the topic.

  • normalize (bool or None, optional) – Whether to normalize the result. Allows for estimation of perplexity, coherence, e.t.c.

Returns

Word ID - probability pairs for the most relevant words generated by the topic.

Return type

list of (int, float)

get_topics(normalize=None)

Get the term-topic matrix learned during inference.

Parameters

normalize (bool or None, optional) – Whether to normalize the result. Allows for estimation of perplexity, coherence, e.t.c.

Returns

The probability for each word in each topic, shape (num_topics, vocabulary_size).

Return type

numpy.ndarray

l2_norm(v)
classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

print_topic(topicno, topn=10)

Get a single topic as a formatted string.

Parameters
  • topicno (int) – Topic id.

  • topn (int) – Number of words from topic that will be used.

Returns

String representation of topic, like ‘-0.340 * “category” + 0.298 * “$M$” + 0.183 * “algebra” + … ‘.

Return type

str

print_topics(num_topics=20, num_words=10)

Get the most significant topics (alias for show_topics() method).

Parameters
  • num_topics (int, optional) – The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).

  • num_words (int, optional) – The number of words to be included per topics (ordered by significance).

Returns

Sequence with (topic_id, [(word, value), … ]).

Return type

list of (int, list of (str, float))

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

show_topic(topicid, topn=10, normalize=None)

Get the representation for a single topic. Words here are the actual strings, in constrast to get_topic_terms() that represents words by their vocabulary ID.

Parameters
  • topicid (int) – The ID of the topic to be returned

  • topn (int, optional) – Number of the most significant words that are associated with the topic.

  • normalize (bool or None, optional) – Whether to normalize the result. Allows for estimation of perplexity, coherence, e.t.c.

Returns

Word - probability pairs for the most relevant words generated by the topic.

Return type

list of (str, float)

show_topics(num_topics=10, num_words=10, log=False, formatted=True, normalize=None)

Get the topics sorted by sparsity.

Parameters
  • num_topics (int, optional) – Number of topics to be returned. Unlike LSA, there is no natural ordering between the topics in NMF. The returned topics subset of all topics is therefore arbitrary and may change between two NMF training runs.

  • num_words (int, optional) – Number of words to be presented for each topic. These will be the most relevant words (assigned the highest probability for each topic).

  • log (bool, optional) – Whether the result is also logged, besides being returned.

  • formatted (bool, optional) – Whether the topic representations should be formatted as strings. If False, they are returned as 2 tuples of (word, probability).

  • normalize (bool or None, optional) – Whether to normalize the result. Allows for estimation of perplexity, coherence, e.t.c.

Returns

a list of topics, each represented either as a string (when formatted == True) or word-probability pairs.

Return type

list of {str, tuple of (str, float)}

top_topics(corpus, texts=None, dictionary=None, window_size=None, coherence='u_mass', topn=20, processes=-1)

Get the topics sorted by coherence.

Parameters
  • corpus (iterable of list of (int, float) or csc_matrix with the shape (n_tokens, n_documents)) – Training corpus. Can be either iterable of documents, which are lists of (word_id, word_count), or a sparse csc matrix of BOWs for each document. If not specified, the model is left uninitialized (presumably, to be trained later with self.train()).

  • texts (list of list of str, optional) – Tokenized texts, needed for coherence models that use sliding window based (i.e. coherence=`c_something`) probability estimator .

  • dictionary ({dict of (int, str), gensim.corpora.dictionary.Dictionary}, optional) – Dictionary mapping of id word to create corpus. If model.id2word is present, this is not needed. If both are provided, passed dictionary will be used.

  • window_size (int, optional) – Is the size of the window to be used for coherence measures using boolean sliding window as their probability estimator. For ‘u_mass’ this doesn’t matter. If None - the default window sizes are used which are: ‘c_v’ - 110, ‘c_uci’ - 10, ‘c_npmi’ - 10.

  • coherence ({'u_mass', 'c_v', 'c_uci', 'c_npmi'}, optional) – Coherence measure to be used. Fastest method - ‘u_mass’, ‘c_uci’ also known as c_pmi. For ‘u_mass’ corpus should be provided, if texts is provided, it will be converted to corpus using the dictionary. For ‘c_v’, ‘c_uci’ and ‘c_npmi’ texts should be provided (corpus isn’t needed)

  • topn (int, optional) – Integer corresponding to the number of top words to be extracted from each topic.

  • processes (int, optional) – Number of processes to use for probability estimation phase, any value less than 1 will be interpreted as num_cpus - 1.

Returns

Each element in the list is a pair of a topic representation and its coherence score. Topic representations are distributions of words, represented as a list of pairs of word IDs and their probabilities.

Return type

list of (list of (int, str), float)

update(corpus, chunksize=None, passes=None, eval_every=None)

Train the model with new documents.

Parameters
  • corpus (iterable of list of (int, float) or csc_matrix with the shape (n_tokens, n_documents)) – Training corpus. Can be either iterable of documents, which are lists of (word_id, word_count), or a sparse csc matrix of BOWs for each document. If not specified, the model is left uninitialized (presumably, to be trained later with self.train()).

  • chunksize (int, optional) – Number of documents to be used in each training chunk.

  • passes (int, optional) – Number of full passes over the training corpus. Leave at default passes=1 if your input is an iterator.

  • eval_every (int, optional) – Number of batches after which l2 norm of (v - Wh) is computed. Decreases performance if set too low.