models.lsimodel
– Latent Semantic Indexing¶
Module for Latent Semantic Analysis (aka Latent Semantic Indexing).
Implements fast truncated SVD (Singular Value Decomposition). The SVD decomposition can be updated with new observations at any time, for an online, incremental, memory-efficient training.
This module actually contains several algorithms for decomposition of large corpora, a combination of which effectively and transparently allows building LSI models for:
corpora much larger than RAM: only constant memory is needed, independent of the corpus size
corpora that are streamed: documents are only accessed sequentially, no random access
corpora that cannot be even temporarily stored: each document can only be seen once and must be processed immediately (one-pass algorithm)
distributed computing for very large corpora, making use of a cluster of machines
Wall-clock performance on the English Wikipedia (2G corpus positions, 3.2M documents, 100K features, 0.5G non-zero entries in the final TF-IDF matrix), requesting the top 400 LSI factors:
algorithm |
serial |
distributed |
---|---|---|
one-pass merge algorithm |
5h14m |
1h41m |
multi-pass stochastic algo (with 2 power iterations) |
5h39m |
N/A 1 |
serial = Core 2 Duo MacBook Pro 2.53Ghz, 4GB RAM, libVec
distributed = cluster of four logical nodes on three physical machines, each with dual core Xeon 2.0GHz, 4GB RAM, ATLAS
Examples
>>> from gensim.test.utils import common_dictionary, common_corpus
>>> from gensim.models import LsiModel
>>>
>>> model = LsiModel(common_corpus, id2word=common_dictionary)
>>> vectorized_corpus = model[common_corpus] # vectorize input copus in BoW format
- 1
The stochastic algo could be distributed too, but most time is already spent reading/decompressing the input from disk in its 4 passes. The extra network traffic due to data distribution across cluster nodes would likely make it slower.
-
class
gensim.models.lsimodel.
LsiModel
(corpus=None, num_topics=200, id2word=None, chunksize=20000, decay=1.0, distributed=False, onepass=True, power_iters=2, extra_samples=100, dtype=<class 'numpy.float64'>)¶ Bases:
gensim.interfaces.TransformationABC
,gensim.models.basemodel.BaseTopicModel
Model for Latent Semantic Indexing.
The decomposition algorithm is described in “Fast and Faster: A Comparison of Two Streamed Matrix Decomposition Algorithms”.
Notes
gensim.models.lsimodel.LsiModel.projection.u
- left singular vectors,gensim.models.lsimodel.LsiModel.projection.s
- singular values,model[training_corpus]
- right singular vectors (can be reconstructed if needed).
See also
Examples
>>> from gensim.test.utils import common_corpus, common_dictionary, get_tmpfile >>> from gensim.models import LsiModel >>> >>> model = LsiModel(common_corpus[:3], id2word=common_dictionary) # train model >>> vector = model[common_corpus[4]] # apply model to BoW document >>> model.add_documents(common_corpus[4:]) # update model with new documents >>> tmp_fname = get_tmpfile("lsi.model") >>> model.save(tmp_fname) # save model >>> loaded_model = LsiModel.load(tmp_fname) # load model
Construct an LsiModel object.
Either corpus or id2word must be supplied in order to train the model.
- Parameters
corpus ({iterable of list of (int, float), scipy.sparse.csc}, optional) – Stream of document vectors or sparse matrix of shape (num_documents, num_terms).
num_topics (int, optional) – Number of requested factors (latent dimensions)
id2word (dict of {int: str}, optional) – ID to word mapping, optional.
chunksize (int, optional) – Number of documents to be used in each training chunk.
decay (float, optional) – Weight of existing observations relatively to new ones.
distributed (bool, optional) – If True - distributed mode (parallel execution on several machines) will be used.
onepass (bool, optional) – Whether the one-pass algorithm should be used for training. Pass False to force a multi-pass stochastic algorithm.
power_iters (int, optional) – Number of power iteration steps to be used. Increasing the number of power iterations improves accuracy, but lowers performance
extra_samples (int, optional) – Extra samples to be used besides the rank k. Can improve accuracy.
dtype (type, optional) – Enforces a type for elements of the decomposed matrix.
-
add_documents
(corpus, chunksize=None, decay=None)¶ Update model with new corpus.
- Parameters
corpus ({iterable of list of (int, float), scipy.sparse.csc}) – Stream of document vectors or sparse matrix of shape (num_terms, num_documents).
chunksize (int, optional) – Number of documents to be used in each training chunk, will use self.chunksize if not specified.
decay (float, optional) – Weight of existing observations relatively to new ones, will use self.decay if not specified.
Notes
Training proceeds in chunks of chunksize documents at a time. The size of chunksize is a tradeoff between increased speed (bigger chunksize) vs. lower memory footprint (smaller chunksize). If the distributed mode is on, each chunk is sent to a different worker/computer.
-
get_topics
()¶ Get the topic vectors.
Notes
The number of topics can actually be smaller than self.num_topics, if there were not enough factors in the matrix (real rank of input matrix smaller than self.num_topics).
- Returns
The term topic matrix with shape (num_topics, vocabulary_size)
- Return type
np.ndarray
-
classmethod
load
(fname, *args, **kwargs)¶ Load a previously saved object using
save()
from file.Notes
Large arrays can be memmap’ed back as read-only (shared memory) by setting the mmap=’r’ parameter.
- Parameters
fname (str) – Path to file that contains LsiModel.
*args – Variable length argument list, see
gensim.utils.SaveLoad.load()
.**kwargs – Arbitrary keyword arguments, see
gensim.utils.SaveLoad.load()
.
See also
- Returns
Loaded instance.
- Return type
- Raises
IOError – When methods are called on instance (should be called from class).
-
print_debug
(num_topics=5, num_words=10)¶ Print (to log) the most salient words of the first num_topics topics.
Unlike
print_topics()
, this looks for words that are significant for a particular topic and not for others. This should result in a more human-interpretable description of topics.Alias for
print_debug()
.- Parameters
num_topics (int, optional) – The number of topics to be selected (ordered by significance).
num_words (int, optional) – The number of words to be included per topics (ordered by significance).
-
print_topic
(topicno, topn=10)¶ Get a single topic as a formatted string.
- Parameters
topicno (int) – Topic id.
topn (int) – Number of words from topic that will be used.
- Returns
String representation of topic, like ‘-0.340 * “category” + 0.298 * “$M$” + 0.183 * “algebra” + … ‘.
- Return type
str
-
print_topics
(num_topics=20, num_words=10)¶ Get the most significant topics (alias for show_topics() method).
- Parameters
num_topics (int, optional) – The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
num_words (int, optional) – The number of words to be included per topics (ordered by significance).
- Returns
Sequence with (topic_id, [(word, value), … ]).
- Return type
list of (int, list of (str, float))
-
save
(fname, *args, **kwargs)¶ Save the model to a file.
Notes
Large internal arrays may be stored into separate files, with fname as prefix.
Warning
Do not save as a compressed file if you intend to load the file back with mmap.
- Parameters
fname (str) – Path to output file.
*args – Variable length argument list, see
gensim.utils.SaveLoad.save()
.**kwargs – Arbitrary keyword arguments, see
gensim.utils.SaveLoad.save()
.
See also
-
show_topic
(topicno, topn=10)¶ Get the words that define a topic along with their contribution.
This is actually the left singular vector of the specified topic.
The most important words in defining the topic (greatest absolute value) are included in the output, along with their contribution to the topic.
- Parameters
topicno (int) – The topics id number.
topn (int) – Number of words to be included to the result.
- Returns
Topic representation in BoW format.
- Return type
list of (str, float)
-
show_topics
(num_topics=-1, num_words=10, log=False, formatted=True)¶ Get the most significant topics.
- Parameters
num_topics (int, optional) – The number of topics to be selected, if -1 - all topics will be in result (ordered by significance).
num_words (int, optional) – The number of words to be included per topics (ordered by significance).
log (bool, optional) – If True - log topics with logger.
formatted (bool, optional) – If True - each topic represented as string, otherwise - in BoW format.
- Returns
list of (int, str) – If formatted=True, return sequence with (topic_id, string representation of topics) OR
list of (int, list of (str, float)) – Otherwise, return sequence with (topic_id, [(word, value), … ]).
-
class
gensim.models.lsimodel.
Projection
(m, k, docs=None, use_svdlibc=False, power_iters=2, extra_dims=100, dtype=<class 'numpy.float64'>)¶ Bases:
gensim.utils.SaveLoad
Low dimensional projection of a term-document matrix.
This is the class taking care of the ‘core math’: interfacing with corpora, splitting large corpora into chunks and merging them etc. This done through the higher-level
LsiModel
class.Notes
The projection can be later updated by merging it with another
Projection
viamerge()
. This is how incremental training actually happens.Construct the (U, S) projection from a corpus.
- Parameters
m (int) – Number of features (terms) in the corpus.
k (int) – Desired rank of the decomposed matrix.
docs ({iterable of list of (int, float), scipy.sparse.csc}) – Corpus in BoW format or as sparse matrix.
use_svdlibc (bool, optional) – If True - will use sparsesvd library, otherwise - our own version will be used.
power_iters (int, optional) – Number of power iteration steps to be used. Tune to improve accuracy.
extra_dims (int, optional) – Extra samples to be used besides the rank k. Tune to improve accuracy.
dtype (numpy.dtype, optional) – Enforces a type for elements of the decomposed matrix.
-
empty_like
()¶ Get an empty Projection with the same parameters as the current object.
- Returns
An empty copy (without corpus) of the current projection.
- Return type
-
classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
-
merge
(other, decay=1.0)¶ Merge current
Projection
instance with another.Warning
The content of other is destroyed in the process, so pass this function a copy of other if you need it further. The other
Projection
is expected to contain the same number of features.- Parameters
other (
Projection
) – The Projection object to be merged into the current one. It will be destroyed after merging.decay (float, optional) – Weight of existing observations relatively to new ones. Setting decay < 1.0 causes re-orientation towards new data trends in the input document stream, by giving less emphasis to old observations. This allows LSA to gradually “forget” old observations (documents) and give more preference to new ones.
-
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶ Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
-
gensim.models.lsimodel.
ascarray
(a, name='')¶ Return a contiguous array in memory (C order).
- Parameters
a (numpy.ndarray) – Input array.
name (str, optional) – Array name, used for logging purposes.
- Returns
Contiguous array (row-major order) of same shape and content as a.
- Return type
np.ndarray
-
gensim.models.lsimodel.
asfarray
(a, name='')¶ Get an array laid out in Fortran order in memory.
- Parameters
a (numpy.ndarray) – Input array.
name (str, optional) – Array name, used only for logging purposes.
- Returns
The input a in Fortran, or column-major order.
- Return type
np.ndarray
-
gensim.models.lsimodel.
clip_spectrum
(s, k, discard=0.001)¶ Find how many factors should be kept to avoid storing spurious (tiny, numerically unstable) values.
- Parameters
s (list of float) – Eigenvalues of the original matrix.
k (int) – Maximum desired rank (number of factors)
discard (float) – Percentage of the spectrum’s energy to be discarded.
- Returns
Rank (number of factors) of the reduced matrix.
- Return type
int
-
gensim.models.lsimodel.
print_debug
(id2token, u, s, topics, num_words=10, num_neg=None)¶ Log the most salient words per topic.
- Parameters
id2token (
Dictionary
) – Mapping from ID to word in the Dictionary.u (np.ndarray) – The 2D U decomposition matrix.
s (np.ndarray) – The 1D reduced array of eigenvalues used for decomposition.
topics (list of int) – Sequence of topic IDs to be printed
num_words (int, optional) – Number of words to be included for each topic.
num_neg (int, optional) – Number of words with a negative contribution to a topic that should be included.
-
gensim.models.lsimodel.
stochastic_svd
(corpus, rank, num_terms, chunksize=20000, extra_dims=None, power_iters=0, dtype=<class 'numpy.float64'>, eps=1e-06)¶ Run truncated Singular Value Decomposition (SVD) on a sparse input.
- Parameters
corpus ({iterable of list of (int, float), scipy.sparse}) – Input corpus as a stream (does not have to fit in RAM) or a sparse matrix of shape (num_terms, num_documents).
rank (int) – Desired number of factors to be retained after decomposition.
num_terms (int) – The number of features (terms) in corpus.
chunksize (int, optional) – Number of documents to be used in each training chunk.
extra_dims (int, optional) – Extra samples to be used besides the rank k. Can improve accuracy.
power_iters (int, optional) – Number of power iteration steps to be used. Increasing the number of power iterations improves accuracy, but lowers performance.
dtype (numpy.dtype, optional) – Enforces a type for elements of the decomposed matrix.
eps (float, optional) – Percentage of the spectrum’s energy to be discarded.
Notes
The corpus may be larger than RAM (iterator of vectors), if corpus is a scipy.sparse.csc instead, it is assumed the whole corpus fits into core memory and a different (more efficient) code path is chosen. This may return less than the requested number of top rank factors, in case the input itself is of lower rank. The extra_dims (oversampling) and especially power_iters (power iterations) parameters affect accuracy of the decomposition.
This algorithm uses 2 + power_iters passes over the input data. In case you can only afford a single pass, set onepass=True in
LsiModel
and avoid using this function directly.The decomposition algorithm is based on “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions”.
- Returns
The left singular vectors and the singular values of the corpus.
- Return type
(np.ndarray 2D, np.ndarray 1D)