sklearn_api.text2bow
– Scikit learn wrapper word<->id mapping¶
Scikit learn interface for Dictionary
.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
>>> from gensim.sklearn_api import Text2BowTransformer
>>>
>>> # Get a corpus as an iterable of unicode strings.
>>> texts = [u'complier system computer', u'loading computer system']
>>>
>>> # Create a transformer..
>>> model = Text2BowTransformer()
>>>
>>> # Use sklearn-style `fit_transform` to get the BOW representation of each document.
>>> model.fit_transform(texts)
[[(0, 1), (1, 1), (2, 1)], [(1, 1), (2, 1), (3, 1)]]
-
class
gensim.sklearn_api.text2bow.
Text2BowTransformer
(prune_at=2000000, tokenizer=<function tokenize>)¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Base Text2Bow module , wraps
Dictionary
.For more information please have a look to Bag-of-words model.
- Parameters
prune_at (int, optional) – Total number of unique words. Dictionary will keep not more than prune_at words.
tokenizer (callable (str -> list of str), optional) – A callable to split a document into a list of each terms, default is
gensim.utils.tokenize()
.
-
fit
(X, y=None)¶ Fit the model according to the given training data.
- Parameters
X (iterable of str) – A collection of documents used for training the model.
- Returns
The trained model.
- Return type
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –
y (ndarray of shape (n_samples,), default=None) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
partial_fit
(X)¶ Train model over a potentially incomplete set of documents.
- This method can be used in two ways:
On an unfitted model in which case the dictionary is initialized and trained on X.
On an already fitted model in which case the dictionary is expanded by X.
- Parameters
X (iterable of str) – A collection of documents used to train the model.
- Returns
The trained model.
- Return type
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
object
-
transform
(docs)¶ Get the BOW format for the docs.
- Parameters
docs ({iterable of str, str}) – A collection of documents to be transformed.
- Returns
The BOW representation of each document.
- Return type
iterable of list (int, int) 2-tuples.