models.wrappers.dtmmodel
– Dynamic Topic Models (DTM) and Dynamic Influence Models (DIM)¶
Python wrapper for Dynamic Topic Models (DTM) and the Document Influence Model (DIM).
Installation¶
You have 2 ways, how to make binaries:
Use precompiled binaries for your OS version from /magsilva/dtm/
Compile binaries manually from /blei-lab/dtm (original instruction available in https://github.com/blei-lab/dtm/blob/master/README.md), or use this
git clone https://github.com/blei-lab/dtm.git sudo apt-get install libgsl0-dev cd dtm/dtm make
Examples
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.wrappers import DtmModel
>>>
>>> path_to_dtm_binary = "/path/to/dtm/binary"
>>> model = DtmModel(
... path_to_dtm_binary, corpus=common_corpus, id2word=common_dictionary,
... time_slices=[1] * len(common_corpus)
... )
-
class
gensim.models.wrappers.dtmmodel.
DtmModel
(dtm_path, corpus=None, time_slices=None, mode='fit', model='dtm', num_topics=100, id2word=None, prefix=None, lda_sequence_min_iter=6, lda_sequence_max_iter=20, lda_max_em_iter=10, alpha=0.01, top_chain_var=0.005, rng_seed=0, initialize_lda=True)¶ Bases:
gensim.utils.SaveLoad
Python wrapper using DTM implementation.
Communication between DTM and Python takes place by passing around data files on disk and executing the DTM binary as a subprocess.
Warning
This is only python wrapper for DTM implementation, you need to install original implementation first and pass the path to binary to
dtm_path
.- Parameters
dtm_path (str) – Path to the dtm binary, e.g. /home/username/dtm/dtm/main.
corpus (iterable of iterable of (int, int)) – Collection of texts in BoW format.
time_slices (list of int) – Sequence of timestamps.
mode ({'fit', 'time'}, optional) – Controls the mode of the mode: ‘fit’ is for training, ‘time’ for analyzing documents through time according to a DTM, basically a held out set.
model ({'fixed', 'dtm'}, optional) – Control model that will be runned: ‘fixed’ is for DIM and ‘dtm’ for DTM.
num_topics (int, optional) – Number of topics.
id2word (
Dictionary
, optional) – Mapping between tokens ids and words from corpus, if not specified - will be inferred from corpus.prefix (str, optional) – Prefix for produced temporary files.
lda_sequence_min_iter (int, optional) – Min iteration of LDA.
lda_sequence_max_iter (int, optional) – Max iteration of LDA.
lda_max_em_iter (int, optional) – Max em optimization iterations in LDA.
alpha (int, optional) – Hyperparameter that affects sparsity of the document-topics for the LDA models in each timeslice.
top_chain_var (float, optional) – This hyperparameter controls one of the key aspect of topic evolution which is the speed at which these topics evolve. A smaller top_chain_var leads to similar word distributions over multiple timeslice.
rng_seed (int, optional) – Random seed.
initialize_lda (bool, optional) – If True - initialize DTM with LDA.
-
convert_input
(corpus, time_slices)¶ Convert corpus into LDA-C format by
BleiCorpus
and save to temp file. Path to temporary file produced byftimeslices()
.- Parameters
corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
time_slices (list of int) – Sequence of timestamps.
-
dtm_coherence
(time, num_words=20)¶ Get all topics of a particular time-slice without probability values for it to be used. For either “u_mass” or “c_v” coherence.
- Parameters
num_words (int) – Number of words.
time (int) – Timestamp
- Returns
coherence_topics – All topics of a particular time-slice without probability values for it to be used.
- Return type
list of list of str
Warning
TODO: because of print format right now can only return for 1st time-slice, should we fix the coherence printing or make changes to the print statements to mirror DTM python?
-
dtm_vis
(corpus, time)¶ Get data specified by pyLDAvis format.
- Parameters
corpus (iterable of iterable of (int, float)) – Collection of texts in BoW format.
time (int) – Sequence of timestamp.
Notes
All of these are needed to visualise topics for DTM for a particular time-slice via pyLDAvis.
- Returns
doc_topic (numpy.ndarray) – Document-topic proportions.
topic_term (numpy.ndarray) – Calculated term of topic suitable for pyLDAvis format.
doc_lengths (list of int) – Length of each documents in corpus.
term_frequency (numpy.ndarray) – Frequency of each word from vocab.
vocab (list of str) – List of words from docpus.
-
fcorpus
()¶ Get path to corpus file.
- Returns
Path to corpus file.
- Return type
str
-
fcorpustxt
()¶ Get path to temporary file.
- Returns
Path to multiple train binary file.
- Return type
str
-
fem_steps
()¶ Get path to temporary em_step data file.
- Returns
Path to em_step data file.
- Return type
str
-
finit_alpha
()¶ Get path to initially trained lda alpha file.
- Returns
Path to initially trained lda alpha file.
- Return type
str
-
finit_beta
()¶ Get path to initially trained lda beta file.
- Returns
Path to initially trained lda beta file.
- Return type
str
-
flda_ss
()¶ Get path to initial lda binary file.
- Returns
Path to initial lda binary file.
- Return type
str
-
fout_gamma
()¶ Get path to temporary gamma data file.
- Returns
Path to gamma data file.
- Return type
str
-
fout_influence
()¶ Get template of path to temporary file.
- Returns
Path to file.
- Return type
str
-
fout_liklihoods
()¶ Get path to temporary lhood data file.
- Returns
Path to lhood data file.
- Return type
str
-
fout_observations
()¶ Get template of path to temporary file.
- Returns
Path to file.
- Return type
str
-
fout_prob
()¶ Get template of path to temporary file.
- Returns
Path to file.
- Return type
str
-
foutname
()¶ Get path to temporary file.
- Returns
Path to file.
- Return type
str
-
ftimeslices
()¶ Get path to time slices binary file.
- Returns
Path to time slices binary file.
- Return type
str
-
classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
-
print_topic
(topicid, time, topn=10, num_words=None)¶ Get the given topic, formatted as a string.
- Parameters
topicid (int) – Id of topic.
time (int) – Timestamp.
topn (int, optional) – Top number of topics that you’ll receive.
num_words (int, optional) – DEPRECATED PARAMETER, use topn instead.
- Returns
The given topic in string format, like ‘0.132*someword + 0.412*otherword + …’.
- Return type
str
-
print_topics
(num_topics=10, times=5, num_words=10)¶ Alias for
show_topics()
.- Parameters
num_topics (int, optional) – Number of topics to return, set -1 to get all topics.
times (int, optional) – Number of times.
num_words (int, optional) – Number of words.
- Returns
Topics as a list of strings
- Return type
list of str
-
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶ Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
-
show_topic
(topicid, time, topn=50, num_words=None)¶ Get num_words most probable words for the given topicid.
- Parameters
topicid (int) – Id of topic.
time (int) – Timestamp.
topn (int, optional) – Top number of topics that you’ll receive.
num_words (int, optional) – DEPRECATED PARAMETER, use topn instead.
- Returns
Sequence of probable words, as a list of (word_probability, word).
- Return type
list of (float, str)
-
show_topics
(num_topics=10, times=5, num_words=10, log=False, formatted=True)¶ Get the num_words most probable words for num_topics number of topics at ‘times’ time slices.
- Parameters
num_topics (int, optional) – Number of topics to return, set -1 to get all topics.
times (int, optional) – Number of times.
num_words (int, optional) – Number of words.
log (bool, optional) – THIS PARAMETER WILL BE IGNORED.
formatted (bool, optional) – If True - return the topics as a list of strings, otherwise as lists of (weight, word) pairs.
- Returns
list of str – Topics as a list of strings (if formatted=True) OR
list of (float, str) – Topics as list of (weight, word) pairs (if formatted=False)
-
train
(corpus, time_slices, mode, model)¶ Train DTM model.
- Parameters
corpus (iterable of iterable of (int, int)) – Collection of texts in BoW format.
time_slices (list of int) – Sequence of timestamps.
mode ({'fit', 'time'}, optional) – Controls the mode of the mode: ‘fit’ is for training, ‘time’ for analyzing documents through time according to a DTM, basically a held out set.
model ({'fixed', 'dtm'}, optional) – Control model that will be runned: ‘fixed’ is for DIM and ‘dtm’ for DTM.