models.wrappers.ldavowpalwabbit
– Latent Dirichlet Allocation via Vowpal Wabbit¶
Python wrapper for Vowpal Wabbit’s Latent Dirichlet Allocation.
This uses Matt Hoffman’s online algorithm, i.e. the same algorithm
that Gensim’s LdaModel
is based on.
Installation¶
Use official guide or this one
git clone https://github.com/JohnLangford/vowpal_wabbit.git
cd vowpal_wabbit
make
make test
sudo make install
Warning
Currently working and tested with Vowpal Wabbit versions 7.10 to 8.1.1. Vowpal Wabbit’s API isn’t currently stable, so this may or may not work with older/newer versions. The aim will be to ensure this wrapper always works with the latest release of Vowpal Wabbit.
Examples
Train model
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.models.wrappers import LdaVowpalWabbit
>>>
>>> path_to_wv_binary = "/path/to/vw/binary"
>>> model = LdaVowpalWabbit(path_to_wv_binary, corpus=common_corpus, num_topics=20, id2word=common_dictionary)
Update existing model
>>> another_corpus = [[(1, 1), (2, 1)], [(3, 5)]]
>>> model.update(another_corpus)
Get topic probability distributions for a document
>>> document_bow = [(1, 1)]
>>> print(model[document_bow])
Print topics
>>> print(model.print_topics())
Save/load the trained model
>>> from gensim.test.utils import get_tmpfile
>>>
>>> temp_path = get_tmpfile("vw_lda.model")
>>> model.save(temp_path)
>>>
>>> loaded_lda = LdaVowpalWabbit.load(temp_path)
Calculate log-perplexoty on given corpus
>>> another_corpus = [[(1, 1), (2, 1)], [(3, 5)]]
>>> print(model.log_perpexity(another_corpus))
Vowpal Wabbit works on files, so this wrapper maintains a temporary directory while it’s around, reading/writing there as necessary.
-
class
gensim.models.wrappers.ldavowpalwabbit.
LdaVowpalWabbit
(vw_path, corpus=None, num_topics=100, id2word=None, chunksize=256, passes=1, alpha=0.1, eta=0.1, decay=0.5, offset=1, gamma_threshold=0.001, random_seed=None, cleanup_files=True, tmp_prefix='tmp')¶ Bases:
gensim.utils.SaveLoad
Python wrapper using Vowpal Wabbit’s online LDA.
Communication between Vowpal Wabbit and Python takes place by passing around data files on disk and calling the ‘vw’ binary with the subprocess module.
Warning
This is only python wrapper for Vowpal Wabbit’s online LDA, you need to install original implementation first and pass the path to binary to
vw_path
.- Parameters
vw_path (str) – Path to Vowpal Wabbit’s binary.
corpus (iterable of list of (int, int), optional) – Collection of texts in BoW format. If given, training will start immediately, otherwise, you should call
train()
orupdate()
manually for training.num_topics (int, optional) – Number of requested latent topics to be extracted from the training corpus. Corresponds to VW’s
--lda <num_topics>
argument.id2word (
Dictionary
, optional) – Mapping from word ids (integers) to words (strings).chunksize (int, optional) – Number of documents examined in each batch. Corresponds to VW’s
--minibatch <batch_size>
argument.passes (int, optional) – Number of passes over the dataset to use. Corresponds to VW’s
--passes <passes>
argument.alpha (float, optional) – Float effecting sparsity of per-document topic weights. This is applied symmetrically, and should be set higher to when documents are thought to look more similar. Corresponds to VW’s
--lda_alpha <alpha>
argument.eta (float, optional) – Affects the sparsity of topic distributions. This is applied symmetrically, and should be set higher when topics are thought to look more similar. Corresponds to VW’s
--lda_rho <rho>
argument.decay (float, optional) – Learning rate decay, affects how quickly learnt values are forgotten. Should be set to a value between 0.5 and 1.0 to guarantee convergence. Corresponds to VW’s
--power_t <tau>
argument.offset (int, optional) – Learning offset, set to higher values to slow down learning on early iterations of the algorithm. Corresponds to VW’s
--initial_t <tau>
argument.gamma_threshold (float, optional) – Affects when learning loop will be broken out of, higher values will result in earlier loop completion. Corresponds to VW’s
--epsilon <eps>
argument.random_seed (int, optional) – Sets random seed when learning. Corresponds to VW’s
--random_seed <seed>
argument.cleanup_files (bool, optional) – Whether or not to delete temporary directory and files used by this wrapper. Setting to False can be useful for debugging, or for re-using Vowpal Wabbit files elsewhere.
tmp_prefix (str, optional) – To prefix temporary working directory name.
-
get_topics
()¶ Get topics X words matrix.
- Returns
num_topics x vocabulary_size array of floats which represents the learned term topic matrix.
- Return type
numpy.ndarray
-
classmethod
load
(fname, *args, **kwargs)¶ Load model from fname.
- Parameters
fname (str) – Path to file with
LdaVowpalWabbit
.
-
log_perplexity
(chunk)¶ Get per-word lower bound on log perplexity.
- Parameters
chunk (iterable of list of (int, int)) – Collection of texts in BoW format.
- Returns
bound – Per-word lower bound on log perplexity.
- Return type
float
-
print_topic
(topicid, topn=10)¶ Get text representation of topic.
- Parameters
topicid (int) – Id of topic.
topn (int, optional) – Top number of words in topic.
- Returns
Topic topicid in text representation.
- Return type
str
-
print_topics
(num_topics=10, num_words=10)¶ Alias for
show_topics()
.- Parameters
num_topics (int, optional) – Number of topics to return, set -1 to get all topics.
num_words (int, optional) – Number of words.
- Returns
Topics as a list of strings
- Return type
list of str
-
save
(fname, *args, **kwargs)¶ Save model to file.
- Parameters
fname (str) – Path to output file.
-
show_topic
(topicid, topn=10)¶ Get num_words most probable words for the given topicid.
- Parameters
topicid (int) – Id of topic.
topn (int, optional) – Top number of topics that you’ll receive.
- Returns
Sequence of probable words, as a list of (word, word_probability) for topicid topic.
- Return type
list of (str, float)
-
show_topics
(num_topics=10, num_words=10, log=False, formatted=True)¶ Get the num_words most probable words for num_topics number of topics.
- Parameters
num_topics (int, optional) – Number of topics to return, set -1 to get all topics.
num_words (int, optional) – Number of words.
log (bool, optional) – If True - will write topics with logger.
formatted (bool, optional) – If True - return the topics as a list of strings, otherwise as lists of (weight, word) pairs.
- Returns
list of str – Topics as a list of strings (if formatted=True) OR
list of (float, str) – Topics as list of (weight, word) pairs (if formatted=False)
-
train
(corpus)¶ Clear any existing model state, and train on given corpus.
- Parameters
corpus (iterable of list of (int, int)) – Collection of texts in BoW format.
-
update
(corpus)¶ Update existing model with corpus.
- Parameters
corpus (iterable of list of (int, int)) – Collection of texts in BoW format.
-
gensim.models.wrappers.ldavowpalwabbit.
corpus_to_vw
(corpus)¶ Convert corpus to Vowpal Wabbit format.
- Parameters
corpus (iterable of list of (int, int)) – Collection of texts in BoW format.
Notes
Vowpal Wabbit format
| 4:7 14:1 22:8 6:3 | 14:22 22:4 0:1 1:3 | 7:2 8:2
- Yields
str – Corpus in Vowpal Wabbit, line by line.
-
gensim.models.wrappers.ldavowpalwabbit.
vwmodel2ldamodel
(vw_model, iterations=50)¶ Convert
LdaVowpalWabbit
toLdaModel
.This works by simply copying the training model weights (alpha, beta…) from a trained vwmodel into the gensim model.
- Parameters
vw_model (
LdaVowpalWabbit
) – Trained Vowpal Wabbit model.iterations (int) – Number of iterations to be used for inference of the new
LdaModel
.
- Returns
Gensim native LDA.
- Return type
-
gensim.models.wrappers.ldavowpalwabbit.
write_corpus_as_vw
(corpus, filename)¶ Covert corpus to Vowpal Wabbit format and save it to filename.
- Parameters
corpus (iterable of list of (int, int)) – Collection of texts in BoW format.
filename (str) – Path to output file.
- Returns
Number of lines in filename.
- Return type
int