models.coherencemodel
– Topic coherence pipeline¶
Calculate topic coherence for topic models. This is the implementation of the four stage topic coherence pipeline
from the paper Michael Roeder, Andreas Both and Alexander Hinneburg: “Exploring the space of topic coherence measures”.
Typically, CoherenceModel
used for evaluation of topic models.
The four stage pipeline is basically:
Segmentation
Probability Estimation
Confirmation Measure
Aggregation
Implementation of this pipeline allows for the user to in essence “make” a coherence measure of his/her choice by choosing a method in each of the pipelines.
See also
gensim.topic_coherence
Internal functions for pipelines.
-
class
gensim.models.coherencemodel.
CoherenceModel
(model=None, topics=None, texts=None, corpus=None, dictionary=None, window_size=None, keyed_vectors=None, coherence='c_v', topn=20, processes=-1)¶ Bases:
gensim.interfaces.TransformationABC
Objects of this class allow for building and maintaining a model for topic coherence.
Examples
One way of using this feature is through providing a trained topic model. A dictionary has to be explicitly provided if the model does not contain a dictionary already
>>> from gensim.test.utils import common_corpus, common_dictionary >>> from gensim.models.ldamodel import LdaModel >>> from gensim.models.coherencemodel import CoherenceModel >>> >>> model = LdaModel(common_corpus, 5, common_dictionary) >>> >>> cm = CoherenceModel(model=model, corpus=common_corpus, coherence='u_mass') >>> coherence = cm.get_coherence() # get coherence value
Another way of using this feature is through providing tokenized topics such as:
>>> from gensim.test.utils import common_corpus, common_dictionary >>> from gensim.models.coherencemodel import CoherenceModel >>> topics = [ ... ['human', 'computer', 'system', 'interface'], ... ['graph', 'minors', 'trees', 'eps'] ... ] >>> >>> cm = CoherenceModel(topics=topics, corpus=common_corpus, dictionary=common_dictionary, coherence='u_mass') >>> coherence = cm.get_coherence() # get coherence value
- Parameters
model (
BaseTopicModel
, optional) – Pre-trained topic model, should be provided if topics is not provided. Currently supportsLdaModel
,LdaMulticore
,LdaMallet
andLdaVowpalWabbit
. Use topics parameter to plug in an as yet unsupported model.topics (list of list of str, optional) – List of tokenized topics, if this is preferred over model - dictionary should be provided.
texts (list of list of str, optional) – Tokenized texts, needed for coherence models that use sliding window based (i.e. coherence=`c_something`) probability estimator .
corpus (iterable of list of (int, number), optional) – Corpus in BoW format.
dictionary (
Dictionary
, optional) – Gensim dictionary mapping of id word to create corpus. If model.id2word is present, this is not needed. If both are provided, passed dictionary will be used.window_size (int, optional) – Is the size of the window to be used for coherence measures using boolean sliding window as their probability estimator. For ‘u_mass’ this doesn’t matter. If None - the default window sizes are used which are: ‘c_v’ - 110, ‘c_uci’ - 10, ‘c_npmi’ - 10.
coherence ({'u_mass', 'c_v', 'c_uci', 'c_npmi'}, optional) – Coherence measure to be used. Fastest method - ‘u_mass’, ‘c_uci’ also known as c_pmi. For ‘u_mass’ corpus should be provided, if texts is provided, it will be converted to corpus using the dictionary. For ‘c_v’, ‘c_uci’ and ‘c_npmi’ texts should be provided (corpus isn’t needed)
topn (int, optional) – Integer corresponding to the number of top words to be extracted from each topic.
processes (int, optional) – Number of processes to use for probability estimation phase, any value less than 1 will be interpreted as num_cpus - 1.
-
aggregate_measures
(topic_coherences)¶ Aggregate the individual topic coherence measures using the pipeline’s aggregation function. Use self.measure.aggr(topic_coherences).
- Parameters
topic_coherences (list of float) – List of calculated confirmation measure on each set in the segmented topics.
- Returns
Arithmetic mean of all the values contained in confirmation measures.
- Return type
float
-
compare_model_topics
(model_topics)¶ Perform the coherence evaluation for each of the models.
- Parameters
model_topics (list of list of str) – list of list of words for the model trained with that number of topics.
- Returns
Sequence of pairs of average topic coherence and average coherence.
- Return type
list of (float, float)
Notes
This first precomputes the probabilities once, then evaluates coherence for each model.
Since we have already precomputed the probabilities, this simply involves using the accumulated stats in the
CoherenceModel
to perform the evaluations, which should be pretty quick.
-
compare_models
(models)¶ Compare topic models by coherence value.
- Parameters
models (
BaseTopicModel
) – Sequence of topic models.- Returns
Sequence of pairs of average topic coherence and average coherence.
- Return type
list of (float, float)
-
estimate_probabilities
(segmented_topics=None)¶ Accumulate word occurrences and co-occurrences from texts or corpus using the optimal method for the chosen coherence metric.
Notes
This operation may take quite some time for the sliding window based coherence methods.
- Parameters
segmented_topics (list of list of pair, optional) – Segmented topics, typically produced by
segment_topics()
.- Returns
Corpus accumulator.
- Return type
-
classmethod
for_models
(models, dictionary, topn=20, **kwargs)¶ Initialize a CoherenceModel with estimated probabilities for all of the given models. Use
for_topics()
method.- Parameters
models (list of
BaseTopicModel
) – List of models to evaluate coherence of, each of it should implementsget_topics()
method.dictionary (
Dictionary
) – Gensim dictionary mapping of id word.topn (int, optional) – Integer corresponding to the number of top words to be extracted from each topic.
kwargs (object) – Sequence of arguments, see
for_topics()
.
- Returns
CoherenceModel with estimated probabilities for all of the given models.
- Return type
Example
>>> from gensim.test.utils import common_corpus, common_dictionary >>> from gensim.models.ldamodel import LdaModel >>> from gensim.models.coherencemodel import CoherenceModel >>> >>> m1 = LdaModel(common_corpus, 3, common_dictionary) >>> m2 = LdaModel(common_corpus, 5, common_dictionary) >>> >>> cm = CoherenceModel.for_models([m1, m2], common_dictionary, corpus=common_corpus, coherence='u_mass')
-
classmethod
for_topics
(topics_as_topn_terms, **kwargs)¶ Initialize a CoherenceModel with estimated probabilities for all of the given topics.
- Parameters
topics_as_topn_terms (list of list of str) – Each element in the top-level list should be the list of topics for a model. The topics for the model should be a list of top-N words, one per topic.
- Returns
CoherenceModel with estimated probabilities for all of the given models.
- Return type
-
get_coherence
()¶ Get coherence value based on pipeline parameters.
- Returns
Value of coherence.
- Return type
float
-
get_coherence_per_topic
(segmented_topics=None, with_std=False, with_support=False)¶ Get list of coherence values for each topic based on pipeline parameters.
- Parameters
segmented_topics (list of list of (int, number)) – Topics.
with_std (bool, optional) – True to also include standard deviation across topic segment sets in addition to the mean coherence for each topic.
with_support (bool, optional) – True to also include support across topic segments. The support is defined as the number of pairwise similarity comparisons were used to compute the overall topic coherence.
- Returns
Sequence of similarity measure for each topic.
- Return type
list of float
-
classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
-
property
measure
¶ Make pipeline, according to coherence parameter value.
- Returns
Pipeline that contains needed functions/method for calculated coherence.
- Return type
namedtuple
-
property
model
¶ Get self._model field.
- Returns
Used model.
- Return type
-
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶ Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
-
segment_topics
()¶ Segment topic, alias for self.measure.seg(self.topics).
- Returns
Segmented topics.
- Return type
list of list of pair
-
static
top_topics_as_word_lists
(model, dictionary, topn=20)¶ Get topn topics as list of words.
- Parameters
model (
BaseTopicModel
) – Pre-trained topic model.dictionary (
Dictionary
) – Gensim dictionary mapping of id word.topn (int, optional) – Integer corresponding to the number of top words to be extracted from each topic.
- Returns
Top topics in list-of-list-of-words format.
- Return type
list of list of str
-
property
topics
¶ Get topics self._topics.
- Returns
Topics as list of tokens.
- Return type
list of list of str
-
property
topn
¶ Get number of top words self._topn.
- Returns
Integer corresponding to the number of top words.
- Return type
int