sklearn_api.ldaseqmodel – Scikit learn wrapper for LdaSeq model

Scikit learn interface for LdaSeqModel.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api.ldaseqmodel import LdaSeqTransformer
>>>
>>> # Create a sequential LDA transformer to extract 2 topics from the common corpus.
>>> # Divide the work into 3 unequal time slices.
>>> model = LdaSeqTransformer(id2word=common_dictionary, num_topics=2, time_slice=[3, 4, 2], initialize='gensim')
>>>
>>> # Each document almost entirely belongs to one of the two topics.
>>> transformed_corpus = model.fit_transform(common_corpus)
class gensim.sklearn_api.ldaseqmodel.LdaSeqTransformer(time_slice=None, id2word=None, alphas=0.01, num_topics=10, initialize='gensim', sstats=None, lda_model=None, obs_variance=0.5, chain_variance=0.005, passes=10, random_state=None, lda_inference_max_iter=25, em_min_iter=6, em_max_iter=20, chunksize=100)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base Sequential LDA module, wraps LdaSeqModel model.

For more information take a look at David M. Blei, John D. Lafferty: “Dynamic Topic Models”.

Parameters
  • time_slice (list of int, optional) – Number of documents in each time-slice.

  • id2word (Dictionary, optional) – Mapping from an ID to the word it represents in the vocabulary.

  • alphas (float, optional) – The prior probability of each topic.

  • num_topics (int, optional) – Number of latent topics to be discovered in the corpus.

  • initialize ({'gensim', 'own', 'ldamodel'}, optional) –

    Controls the initialization of the DTM model. Supports three different modes:
    • ’gensim’: Uses gensim’s own LDA initialization.

    • ’own’: Uses your own initialization matrix of an LDA model that has been previously trained.

    • ’lda_model’: Use a previously used LDA model, passing it through the lda_model argument.

  • sstats (np.ndarray of shape [vocab_len, num_topics], optional) – If initialize is set to ‘own’ this will be used to initialize the DTM model.

  • lda_model (LdaModel, optional) – If initialize is set to ‘lda_model’ this object will be used to create the sstats initialization matrix.

  • obs_variance (float, optional) –

    Observed variance used to approximate the true and forward variance as shown in David M. Blei, John D. Lafferty: “Dynamic Topic Models”.

  • chain_variance (float, optional) – Gaussian parameter defined in the beta distribution to dictate how the beta values evolve.

  • passes (int, optional) – Number of passes over the corpus for the initial LdaModel

  • random_state ({numpy.random.RandomState, int}, optional) – Can be a np.random.RandomState object, or the seed to generate one. Used for reproducibility of results.

  • lda_inference_max_iter (int, optional) – Maximum number of iterations in the inference step of the LDA training.

  • em_min_iter (int, optional) – Minimum number of iterations until converge of the Expectation-Maximization algorithm

  • em_max_iter (int, optional) – Maximum number of iterations until converge of the Expectation-Maximization algorithm

  • chunksize (int, optional) – Number of documents in the corpus do be processed in in a chunk.

fit(X, y=None)

Fit the model according to the given training data.

Parameters

X ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format used for training the model.

Returns

The trained model.

Return type

LdaSeqTransformer

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –

  • y (ndarray of shape (n_samples,), default=None) – Target values.

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object

transform(docs)

Infer the topic distribution for docs.

Parameters

docs ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format to be transformed.

Returns

The topic representation of each document.

Return type

numpy.ndarray of shape [len(docs), num_topics]