sklearn_api.lsimodel
– Scikit learn wrapper for Latent Semantic Indexing¶
Scikit learn interface for gensim.models.lsimodel.LsiModel
.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
Integrate with sklearn Pipelines:
>>> from sklearn.pipeline import Pipeline
>>> from sklearn import linear_model
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api import LsiTransformer
>>>
>>> # Create stages for our pipeline (including gensim and sklearn models alike).
>>> model = LsiTransformer(num_topics=15, id2word=common_dictionary)
>>> clf = linear_model.LogisticRegression(penalty='l2', C=0.1)
>>> pipe = Pipeline([('features', model,), ('classifier', clf)])
>>>
>>> # Create some random binary labels for our documents.
>>> labels = np.random.choice([0, 1], len(common_corpus))
>>>
>>> # How well does our pipeline perform on the training set?
>>> score = pipe.fit(common_corpus, labels).score(common_corpus, labels)
-
class
gensim.sklearn_api.lsimodel.
LsiTransformer
(num_topics=200, id2word=None, chunksize=20000, decay=1.0, onepass=True, power_iters=2, extra_samples=100)¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Base LSI module, wraps
LsiModel
.For more information please have a look to Latent semantic analysis.
- Parameters
num_topics (int, optional) – Number of requested factors (latent dimensions).
id2word (
Dictionary
, optional) – ID to word mapping, optional.chunksize (int, optional) – Number of documents to be used in each training chunk.
decay (float, optional) – Weight of existing observations relatively to new ones.
onepass (bool, optional) – Whether the one-pass algorithm should be used for training, pass False to force a multi-pass stochastic algorithm.
power_iters (int, optional) – Number of power iteration steps to be used. Increasing the number of power iterations improves accuracy, but lowers performance.
extra_samples (int, optional) – Extra samples to be used besides the rank k. Can improve accuracy.
-
fit
(X, y=None)¶ Fit the model according to the given training data.
- Parameters
X ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format to be transformed.
- Returns
The trained model.
- Return type
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –
y (ndarray of shape (n_samples,), default=None) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
partial_fit
(X)¶ Train model over a potentially incomplete set of documents.
- This method can be used in two ways:
On an unfitted model in which case the model is initialized and trained on X.
On an already fitted model in which case the model is further trained on X.
- Parameters
X ({iterable of list of (int, number), scipy.sparse matrix}) – Stream of document vectors or sparse matrix of shape: [num_documents, num_terms].
- Returns
The trained model.
- Return type
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
object
-
transform
(docs)¶ Computes the latent factors for docs.
- Parameters
docs ({iterable of list of (int, number), list of (int, number), scipy.sparse matrix}) – Document or collection of documents in BOW format to be transformed.
- Returns
Topic distribution matrix.
- Return type
numpy.ndarray of shape [len(docs), num_topics]