`models.atmodel` – Author-topic models¶

Author-topic model.

This module trains the author-topic model on documents and corresponding author-document dictionaries. The training is online and is constant in memory w.r.t. the number of documents. The model is not constant in memory w.r.t. the number of authors.

The model can be updated with additional documents after training has been completed. It is also possible to continue training on the existing data.

The model is closely related to LdaModel. The AuthorTopicModel class inherits LdaModel, and its usage is thus similar.

The model was introduced by Rosen-Zvi and co-authors: “The Author-Topic Model for Authors and Documents”. The model correlates the authorship information with the topics to give a better insight on the subject knowledge of an author.

Example

>>> from gensim.models import AuthorTopicModel
>>> from gensim.corpora import mmcorpus
>>> from gensim.test.utils import common_dictionary, datapath, temporary_file

>>> author2doc = {
...     'john': [0, 1, 2, 3, 4, 5, 6],
...     'jane': [2, 3, 4, 5, 6, 7, 8],
...     'jack': [0, 2, 4, 6, 8]
... }
>>>
>>> corpus = mmcorpus.MmCorpus(datapath('testcorpus.mm'))
>>>
>>> with temporary_file("serialized") as s_path:
...     model = AuthorTopicModel(
...         corpus, author2doc=author2doc, id2word=common_dictionary, num_topics=4,
...         serialized=True, serialization_path=s_path
...     )
...
...     model.update(corpus, author2doc)  # update the author-topic model with additional documents
>>>
>>> # construct vectors for authors
>>> author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]

class gensim.models.atmodel.AuthorTopicModel(corpus=None, num_topics=100, id2word=None, author2doc=None, doc2author=None, chunksize=2000, passes=1, iterations=50, decay=0.5, offset=1.0, alpha='symmetric', eta='symmetric', update_every=1, eval_every=10, gamma_threshold=0.001, serialized=False, serialization_path=None, minimum_probability=0.01, random_state=None)¶

Bases: gensim.models.ldamodel.LdaModel

The constructor estimates the author-topic model parameters based on a training corpus.

Parameters

corpus (iterable of list of (int, float), optional) – Corpus in BoW format
num_topics (int, optional) – Number of topics to be extracted from the training corpus.
id2word (Dictionary, optional) – A mapping from word ids (integers) to words (strings).
author2doc (dict of (str, list of int), optional) – A dictionary where keys are the names of authors and values are lists of document IDs that the author contributes to.
doc2author (dict of (int, list of str), optional) – A dictionary where the keys are document IDs and the values are lists of author names.
chunksize (int, optional) – Controls the size of the mini-batches.
passes (int, optional) – Number of times the model makes a pass over the entire training data.
iterations (int, optional) – Maximum number of times the model loops over each document.
decay (float, optional) – Controls how old documents are forgotten.
offset (float, optional) – Controls down-weighting of iterations.
alpha (float, optional) – Hyperparameters for author-topic model.Supports special values of ‘asymmetric’ and ‘auto’: the former uses a fixed normalized asymmetric 1.0/topicno prior, the latter learns an asymmetric prior directly from your data.
eta (float, optional) – Hyperparameters for author-topic model.
update_every (int, optional) – Make updates in topic probability for latest mini-batch.
eval_every (int, optional) – Calculate and estimate log perplexity for latest mini-batch.
gamma_threshold (float, optional) – Threshold value of gamma(topic difference between consecutive two topics) until which the iterations continue.
serialized (bool, optional) – Indicates whether the input corpora to the model are simple lists or saved to the hard-drive.
serialization_path (str, optional) – Must be set to a filepath, if serialized = True is used.
minimum_probability (float, optional) – Controls filtering the topics returned for a document (bow).
random_state ({int, numpy.random.RandomState}, optional) – Set the state of the random number generator inside the author-topic model.

bound(chunk, chunk_doc_idx=None, subsample_ratio=1.0, author2doc=None, doc2author=None)¶

Estimate the variational bound of documents from corpus.

$\mathbb{E_{q}}[\log p(corpus)] - \mathbb{E_{q}}[\log q(corpus)]$

Notes

There are basically two use cases of this method:

chunk is a subset of the training corpus, and chunk_doc_idx is provided, indicating the indexes of the documents in the training corpus.
chunk is a test set (held-out data), and author2doc and doc2author corresponding to this test set are provided. There must not be any new authors passed to this method, chunk_doc_idx is not needed in this case.

Parameters

chunk (iterable of list of (int, float)) – Corpus in BoW format.
chunk_doc_idx (numpy.ndarray, optional) – Assigns the value for document index.
subsample_ratio (float, optional) – Used for calculation of word score for estimation of variational bound.
author2doc (dict of (str, list of int), optional) – A dictionary where keys are the names of authors and values are lists of documents that the author contributes to.
doc2author (dict of (int, list of str), optional) – A dictionary where the keys are document IDs and the values are lists of author names.

Returns

Value of variational bound score.

Return type

float

clear()¶: Clear the model’s state to free some memory. Used in the distributed implementation.

compute_phinorm(expElogthetad, expElogbetad)¶

Efficiently computes the normalizing factor in phi.

Parameters

expElogthetad (numpy.ndarray) – Value of variational distribution $q(\theta|\gamma)$ .
expElogbetad (numpy.ndarray) – Value of variational distribution $q(\beta|\lambda)$ .

Returns

Value of normalizing factor.

Return type

float

diff(other, distance='kullback_leibler', num_words=100, n_ann_terms=10, diagonal=False, annotation=True, normed=True)¶

Calculate the difference in topic distributions between two models: self and other.

Parameters

other (LdaModel) – The model which will be compared against the current object.
distance ({'kullback_leibler', 'hellinger', 'jaccard', 'jensen_shannon'}) – The distance metric to calculate the difference with.
num_words (int, optional) – The number of most relevant words used if distance == ‘jaccard’. Also used for annotating topics.
n_ann_terms (int, optional) – Max number of words in intersection/symmetric difference between topics. Used for annotation.
diagonal (bool, optional) – Whether we need the difference between identical topics (the diagonal of the difference matrix).
annotation (bool, optional) – Whether the intersection or difference of words between two topics should be returned.
normed (bool, optional) – Whether the matrix should be normalized or not.

Returns

numpy.ndarray – A difference matrix. Each element corresponds to the difference between the two topics, shape (self.num_topics, other.num_topics)
numpy.ndarray, optional – Annotation matrix where for each pair we include the word from the intersection of the two topics, and the word from the symmetric difference of the two topics. Only included if annotation == True. Shape (self.num_topics, other_model.num_topics, 2).

Examples

Get the differences between each pair of topics inferred by two models

>>> from gensim.models.ldamulticore import LdaMulticore
>>> from gensim.test.utils import datapath
>>>
>>> m1 = LdaMulticore.load(datapath("lda_3_0_1_model"))
>>> m2 = LdaMulticore.load(datapath("ldamodel_python_3_5"))
>>> mdiff, annotation = m1.diff(m2)
>>> topic_diff = mdiff  # get matrix with difference for each topic pair from `m1` and `m2`

do_estep(chunk, author2doc, doc2author, rhot, state=None, chunk_doc_idx=None)¶

Performs inference (E-step) on a chunk of documents, and accumulate the collected sufficient statistics.

Parameters

chunk (iterable of list of (int, float)) – Corpus in BoW format.
author2doc (dict of (str, list of int), optional) – A dictionary where keys are the names of authors and values are lists of document IDs that the author contributes to.
doc2author (dict of (int, list of str), optional) – A dictionary where the keys are document IDs and the values are lists of author names.
rhot (float) – Value of rho for conducting inference on documents.
state (int, optional) – Initializes the state for a new E iteration.
chunk_doc_idx (numpy.ndarray, optional) – Assigns the value for document index.

Returns

Value of gamma for training of model.

Return type

float

do_mstep(rho, other, extra_pass=False)¶

Maximization step: use linear interpolation between the existing topics and collected sufficient statistics in other to update the topics.

Parameters

rho (float) – Learning rate.
other (LdaModel) – The model whose sufficient statistics will be used to update the topics.
extra_pass (bool, optional) – Whether this step required an additional pass over the corpus.

extend_corpus(corpus)¶

Add new documents from corpus to self.corpus.

If serialization is used, then the entire corpus (self.corpus) is re-serialized and the new documents are added in the process. If serialization is not used, the corpus, as a list of documents, is simply extended.

Parameters: corpus (iterable of list of (int, float)) – Corpus in BoW format
Raises: AssertionError – If serialized == False and corpus isn’t list.

get_author_topics(author_name, minimum_probability=None)¶

Get topic distribution the given author.

Parameters

author_name (str) – Name of the author for which the topic distribution needs to be estimated.
minimum_probability (float, optional) – Sets the minimum probability value for showing the topics of a given author, topics with probability < minimum_probability will be ignored.

Returns

Topic distribution of an author.

Return type

list of (int, float)

Example

>>> from gensim.models import AuthorTopicModel
>>> from gensim.corpora import mmcorpus
>>> from gensim.test.utils import common_dictionary, datapath, temporary_file

>>> author2doc = {
...     'john': [0, 1, 2, 3, 4, 5, 6],
...     'jane': [2, 3, 4, 5, 6, 7, 8],
...     'jack': [0, 2, 4, 6, 8]
... }
>>>
>>> corpus = mmcorpus.MmCorpus(datapath('testcorpus.mm'))
>>>
>>> with temporary_file("serialized") as s_path:
...     model = AuthorTopicModel(
...         corpus, author2doc=author2doc, id2word=common_dictionary, num_topics=4,
...         serialized=True, serialization_path=s_path
...     )
...
...     model.update(corpus, author2doc)  # update the author-topic model with additional documents
>>>
>>> # construct vectors for authors
>>> author_vecs = [model.get_author_topics(author) for author in model.id2author.values()]

get_document_topics(word_id, minimum_probability=None)¶

Override get_document_topics() and simply raises an exception.

Warning

This method invalid for model, use get_author_topics() or get_new_author_topics() instead.

Raises: NotImplementedError – Always.

get_new_author_topics(corpus, minimum_probability=None)¶

Infers topics for new author.

Infers a topic distribution for a new author over the passed corpus of docs, assuming that all documents are from this single new author.

Parameters

corpus (iterable of list of (int, float)) – Corpus in BoW format.
minimum_probability (float, optional) – Ignore topics with probability below this value, if None - 1e-8 is used.

Returns

Topic distribution for the given corpus.

Return type

list of (int, float)

get_term_topics(word_id, minimum_probability=None)¶

Get the most relevant topics to the given word.

Parameters

word_id (int) – The word for which the topic distribution will be computed.
minimum_probability (float, optional) – Topics with an assigned probability below this threshold will be discarded.

Returns

The relevant topics represented as pairs of their ID and their assigned probability, sorted by relevance to the given word.

Return type

list of (int, float)

get_topic_terms(topicid, topn=10)¶

Get the representation for a single topic. Words the integer IDs, in constrast to show_topic() that represents words by the actual strings.

Parameters

topicid (int) – The ID of the topic to be returned
topn (int, optional) – Number of the most significant words that are associated with the topic.

Returns

Word ID - probability pairs for the most relevant words generated by the topic.

Return type

list of (int, float)

get_topics()¶

Get the term-topic matrix learned during inference.

Returns: The probability for each word in each topic, shape (num_topics, vocabulary_size).
Return type: numpy.ndarray

inference(chunk, author2doc, doc2author, rhot, collect_sstats=False, chunk_doc_idx=None)¶

Give a chunk of sparse document vectors, update gamma for each author corresponding to the chuck.

Warning

The whole input chunk of document is assumed to fit in RAM, chunking of a large corpus must be done earlier in the pipeline.

Avoids computing the phi variational parameter directly using the optimization presented in Lee, Seung: “Algorithms for non-negative matrix factorization”, NIPS 2001.

Parameters

chunk (iterable of list of (int, float)) – Corpus in BoW format.
author2doc (dict of (str, list of int), optional) – A dictionary where keys are the names of authors and values are lists of document IDs that the author contributes to.
doc2author (dict of (int, list of str), optional) – A dictionary where the keys are document IDs and the values are lists of author names.
rhot (float) – Value of rho for conducting inference on documents.
collect_sstats (boolean, optional) – If True - collect sufficient statistics needed to update the model’s topic-word distributions, and return (gamma_chunk, sstats). Otherwise, return (gamma_chunk, None). gamma_chunk is of shape len(chunk_authors) x self.num_topics,where chunk_authors is the number of authors in the documents in the current chunk.
chunk_doc_idx (numpy.ndarray, optional) – Assigns the value for document index.

Returns

gamma_chunk and sstats (if collect_sstats == True, otherwise - None)

Return type

(numpy.ndarray, numpy.ndarray)

init_dir_prior(prior, name)¶

Initialize priors for the Dirichlet distribution.

Parameters

prior ({str, list of float, numpy.ndarray of float, float}) –
A-priori belief on word probability. If name == ‘eta’ then the prior can be:
- scalar for a symmetric prior over topic/word probability,
- vector of length num_words to denote an asymmetric user defined probability for each word,
- matrix of shape (num_topics, num_words) to assign a probability for each word-topic combination,
- the string ‘auto’ to learn the asymmetric prior from the data.
If name == ‘alpha’, then the prior can be:
- an 1D array of length equal to the number of expected topics,
- ’symmetric’: Uses a fixed symmetric prior per topic,
- ’asymmetric’: Uses a fixed normalized asymmetric prior of 1.0 / (topic_index + sqrt(num_topics)),
- ’auto’: Learns an asymmetric prior from the corpus.
name ({'alpha', 'eta'}) – Whether the prior is parameterized by the alpha vector (1 parameter per topic) or by the eta (1 parameter per unique term in the vocabulary).

init_empty_corpus()¶: Initialize an empty corpus. If the corpora are to be treated as lists, simply initialize an empty list. If serialization is used, initialize an empty corpus using MmCorpus.

classmethod load(fname, *args, **kwargs)¶

Load a previously saved gensim.models.ldamodel.LdaModel from file.

Get Expert Help From The Gensim Authors

models.atmodel – Author-topic models¶

`models.atmodel` – Author-topic models¶