models.wrappers.dtmmodel – Dynamic Topic Models (DTM) and Dynamic Influence Models (DIM)

Python wrapper for Dynamic Topic Models (DTM) and the Document Influence Model (DIM).

Installation

You have 2 ways, how to make binaries:

  1. Use precompiled binaries for your OS version from /magsilva/dtm/

  2. Compile binaries manually from /blei-lab/dtm (original instruction available in https://github.com/blei-lab/dtm/blob/master/README.md), or use this

    git clone https://github.com/blei-lab/dtm.git
    sudo apt-get install libgsl0-dev
    cd dtm/dtm
    make
    

Examples

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.wrappers import DtmModel
>>>
>>> path_to_dtm_binary = "/path/to/dtm/binary"
>>> model = DtmModel(
...     path_to_dtm_binary, corpus=common_corpus, id2word=common_dictionary,
...     time_slices=[1] * len(common_corpus)
... )
class gensim.models.wrappers.dtmmodel.DtmModel(dtm_path, corpus=None, time_slices=None, mode='fit', model='dtm', num_topics=100, id2word=None, prefix=None, lda_sequence_min_iter=6, lda_sequence_max_iter=20, lda_max_em_iter=10, alpha=0.01, top_chain_var=0.005, rng_seed=0, initialize_lda=True)

Bases: gensim.utils.SaveLoad

Python wrapper using DTM implementation.

Communication between DTM and Python takes place by passing around data files on disk and executing the DTM binary as a subprocess.

Warning

This is only python wrapper for DTM implementation, you need to install original implementation first and pass the path to binary to dtm_path.

Parameters
  • dtm_path (str) – Path to the dtm binary, e.g. /home/username/dtm/dtm/main.

  • corpus (iterable of iterable of (int, int)) – Collection of texts in BoW format.

  • time_slices (list of int) – Sequence of timestamps.

  • mode ({'fit', 'time'}, optional) – Controls the mode of the mode: ‘fit’ is for training, ‘time’ for analyzing documents through time according to a DTM, basically a held out set.

  • model ({'fixed', 'dtm'}, optional) – Control model that will be runned: ‘fixed’ is for DIM and ‘dtm’ for DTM.

  • num_topics (int, optional) – Number of topics.

  • id2word (Dictionary, optional) – Mapping between tokens ids and words from corpus, if not specified - will be inferred from corpus.

  • prefix (str, optional) – Prefix for produced temporary files.

  • lda_sequence_min_iter (int, optional) – Min iteration of LDA.

  • lda_sequence_max_iter (int, optional) – Max iteration of LDA.

  • lda_max_em_iter (int, optional) – Max em optimization iterations in LDA.

  • alpha (int, optional) – Hyperparameter that affects sparsity of the document-topics for the LDA models in each timeslice.

  • top_chain_var (float, optional) – This hyperparameter controls one of the key aspect of topic evolution which is the speed at which these topics evolve. A smaller top_chain_var leads to similar word distributions over multiple timeslice.

  • rng_seed (int, optional) – Random seed.

  • initialize_lda (bool, optional) – If True - initialize DTM with LDA.

convert_input(corpus, time_slices)

Convert corpus into LDA-C format by BleiCorpus and save to temp file. Path to temporary file produced by ftimeslices().

Parameters
  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.

  • time_slices (list of int) – Sequence of timestamps.

dtm_coherence(time, num_words=20)

Get all topics of a particular time-slice without probability values for it to be used. For either “u_mass” or “c_v” coherence.

Parameters
  • num_words (int) – Number of words.

  • time (int) – Timestamp

Returns

coherence_topics – All topics of a particular time-slice without probability values for it to be used.

Return type

list of list of str

Warning

TODO: because of print format right now can only return for 1st time-slice, should we fix the coherence printing or make changes to the print statements to mirror DTM python?

dtm_vis(corpus, time)

Get data specified by pyLDAvis format.

Parameters
  • corpus (iterable of iterable of (int, float)) – Collection of texts in BoW format.

  • time (int) – Sequence of timestamp.

Notes

All of these are needed to visualise topics for DTM for a particular time-slice via pyLDAvis.

Returns

  • doc_topic (numpy.ndarray) – Document-topic proportions.

  • topic_term (numpy.ndarray) – Calculated term of topic suitable for pyLDAvis format.

  • doc_lengths (list of int) – Length of each documents in corpus.

  • term_frequency (numpy.ndarray) – Frequency of each word from vocab.

  • vocab (list of str) – List of words from docpus.

fcorpus()

Get path to corpus file.

Returns

Path to corpus file.

Return type

str

fcorpustxt()

Get path to temporary file.

Returns

Path to multiple train binary file.

Return type

str

fem_steps()

Get path to temporary em_step data file.

Returns

Path to em_step data file.

Return type

str

finit_alpha()

Get path to initially trained lda alpha file.

Returns

Path to initially trained lda alpha file.

Return type

str

finit_beta()

Get path to initially trained lda beta file.

Returns

Path to initially trained lda beta file.

Return type

str

flda_ss()

Get path to initial lda binary file.

Returns

Path to initial lda binary file.

Return type

str

fout_gamma()

Get path to temporary gamma data file.

Returns

Path to gamma data file.

Return type

str

fout_influence()

Get template of path to temporary file.

Returns

Path to file.

Return type

str

fout_liklihoods()

Get path to temporary lhood data file.

Returns

Path to lhood data file.

Return type

str

fout_observations()

Get template of path to temporary file.

Returns

Path to file.

Return type

str

fout_prob()

Get template of path to temporary file.

Returns

Path to file.

Return type

str

foutname()

Get path to temporary file.

Returns

Path to file.

Return type

str

ftimeslices()

Get path to time slices binary file.

Returns

Path to time slices binary file.

Return type

str

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

print_topic(topicid, time, topn=10, num_words=None)

Get the given topic, formatted as a string.

Parameters
  • topicid (int) – Id of topic.

  • time (int) – Timestamp.

  • topn (int, optional) – Top number of topics that you’ll receive.

  • num_words (int, optional) – DEPRECATED PARAMETER, use topn instead.

Returns

The given topic in string format, like ‘0.132*someword + 0.412*otherword + …’.

Return type

str

print_topics(num_topics=10, times=5, num_words=10)

Alias for show_topics().

Parameters
  • num_topics (int, optional) – Number of topics to return, set -1 to get all topics.

  • times (int, optional) – Number of times.

  • num_words (int, optional) – Number of words.

Returns

Topics as a list of strings

Return type

list of str

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

show_topic(topicid, time, topn=50, num_words=None)

Get num_words most probable words for the given topicid.

Parameters
  • topicid (int) – Id of topic.

  • time (int) – Timestamp.

  • topn (int, optional) – Top number of topics that you’ll receive.

  • num_words (int, optional) – DEPRECATED PARAMETER, use topn instead.

Returns

Sequence of probable words, as a list of (word_probability, word).

Return type

list of (float, str)

show_topics(num_topics=10, times=5, num_words=10, log=False, formatted=True)

Get the num_words most probable words for num_topics number of topics at ‘times’ time slices.

Parameters
  • num_topics (int, optional) – Number of topics to return, set -1 to get all topics.

  • times (int, optional) – Number of times.

  • num_words (int, optional) – Number of words.

  • log (bool, optional) – THIS PARAMETER WILL BE IGNORED.

  • formatted (bool, optional) – If True - return the topics as a list of strings, otherwise as lists of (weight, word) pairs.

Returns

  • list of str – Topics as a list of strings (if formatted=True) OR

  • list of (float, str) – Topics as list of (weight, word) pairs (if formatted=False)

train(corpus, time_slices, mode, model)

Train DTM model.

Parameters
  • corpus (iterable of iterable of (int, int)) – Collection of texts in BoW format.

  • time_slices (list of int) – Sequence of timestamps.

  • mode ({'fit', 'time'}, optional) – Controls the mode of the mode: ‘fit’ is for training, ‘time’ for analyzing documents through time according to a DTM, basically a held out set.

  • model ({'fixed', 'dtm'}, optional) – Control model that will be runned: ‘fixed’ is for DIM and ‘dtm’ for DTM.