`sklearn_api.text2bow` – Scikit learn wrapper word<->id mapping¶

Scikit learn interface for Dictionary.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

>>> from gensim.sklearn_api import Text2BowTransformer
>>>
>>> # Get a corpus as an iterable of unicode strings.
>>> texts = [u'complier system computer', u'loading computer system']
>>>
>>> # Create a transformer..
>>> model = Text2BowTransformer()
>>>
>>> # Use sklearn-style `fit_transform` to get the BOW representation of each document.
>>> model.fit_transform(texts)
[[(0, 1), (1, 1), (2, 1)], [(1, 1), (2, 1), (3, 1)]]

class gensim.sklearn_api.text2bow.Text2BowTransformer(prune_at=2000000, tokenizer=<function tokenize>)¶

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base Text2Bow module , wraps Dictionary.

For more information please have a look to Bag-of-words model.

Parameters

prune_at (int, optional) – Total number of unique words. Dictionary will keep not more than prune_at words.
tokenizer (callable (str -> list of str), optional) – A callable to split a document into a list of each terms, default is gensim.utils.tokenize().

fit(X, y=None)¶

Fit the model according to the given training data.

Parameters: X (iterable of str) – A collection of documents used for training the model.
Returns: The trained model.
Return type: Text2BowTransformer

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters

X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –
y (ndarray of shape (n_samples,), default=None) – Target values.
**fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)¶

Get parameters for this estimator.

Parameters: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params – Parameter names mapped to their values.
Return type: mapping of string to any

partial_fit(X)¶

Train model over a potentially incomplete set of documents.

This method can be used in two ways:

On an unfitted model in which case the dictionary is initialized and trained on X.
On an already fitted model in which case the dictionary is expanded by X.

Parameters: X (iterable of str) – A collection of documents used to train the model.
Returns: The trained model.
Return type: Text2BowTransformer

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters: **params (dict) – Estimator parameters.
Returns: self – Estimator instance.
Return type: object

transform(docs)¶

Get the BOW format for the docs.

Parameters: docs ({iterable of str, str}) – A collection of documents to be transformed.
Returns: The BOW representation of each document.
Return type: iterable of list (int, int) 2-tuples.

Get Expert Help From The Gensim Authors

sklearn_api.text2bow – Scikit learn wrapper word<->id mapping¶

`sklearn_api.text2bow` – Scikit learn wrapper word<->id mapping¶