sklearn_api.ldaseqmodel
– Scikit learn wrapper for LdaSeq model¶
Scikit learn interface for LdaSeqModel
.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api.ldaseqmodel import LdaSeqTransformer
>>>
>>> # Create a sequential LDA transformer to extract 2 topics from the common corpus.
>>> # Divide the work into 3 unequal time slices.
>>> model = LdaSeqTransformer(id2word=common_dictionary, num_topics=2, time_slice=[3, 4, 2], initialize='gensim')
>>>
>>> # Each document almost entirely belongs to one of the two topics.
>>> transformed_corpus = model.fit_transform(common_corpus)
-
class
gensim.sklearn_api.ldaseqmodel.
LdaSeqTransformer
(time_slice=None, id2word=None, alphas=0.01, num_topics=10, initialize='gensim', sstats=None, lda_model=None, obs_variance=0.5, chain_variance=0.005, passes=10, random_state=None, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, chunksize=100)¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Base Sequential LDA module, wraps
LdaSeqModel
model.For more information take a look at David M. Blei, John D. Lafferty: “Dynamic Topic Models”.
- Parameters
time_slice (list of int, optional) – Number of documents in each time-slice.
id2word (
Dictionary
, optional) – Mapping from an ID to the word it represents in the vocabulary.alphas (float, optional) – The prior probability of each topic.
num_topics (int, optional) – Number of latent topics to be discovered in the corpus.
initialize ({'gensim', 'own', 'ldamodel'}, optional) –
- Controls the initialization of the DTM model. Supports three different modes:
’gensim’: Uses gensim’s own LDA initialization.
’own’: Uses your own initialization matrix of an LDA model that has been previously trained.
’lda_model’: Use a previously used LDA model, passing it through the lda_model argument.
sstats (np.ndarray of shape [vocab_len, num_topics], optional) – If initialize is set to ‘own’ this will be used to initialize the DTM model.
lda_model (
LdaModel
, optional) – If initialize is set to ‘lda_model’ this object will be used to create the sstats initialization matrix.obs_variance (float, optional) –
Observed variance used to approximate the true and forward variance as shown in David M. Blei, John D. Lafferty: “Dynamic Topic Models”.
chain_variance (float, optional) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve.
passes (int, optional) – Number of passes over the corpus for the initial
LdaModel
random_state ({numpy.random.RandomState, int}, optional) – Can be a np.random.RandomState object, or the seed to generate one. Used for reproducibility of results.
lda_inference_max_iter (int, optional) – Maximum number of iterations in the inference step of the LDA training.
em_min_iter (int, optional) – Minimum number of iterations until converge of the Expectation-Maximization algorithm
em_max_iter (int, optional) – Maximum number of iterations until converge of the Expectation-Maximization algorithm
chunksize (int, optional) – Number of documents in the corpus do be processed in in a chunk.
-
fit
(X, y=None)¶ Fit the model according to the given training data.
- Parameters
X ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format used for training the model.
- Returns
The trained model.
- Return type
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –
y (ndarray of shape (n_samples,), default=None) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
object
-
transform
(docs)¶ Infer the topic distribution for docs.
- Parameters
docs ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format to be transformed.
- Returns
The topic representation of each document.
- Return type
numpy.ndarray of shape [len(docs), num_topics]