sklearn_api.phrases – Scikit learn wrapper for phrase (collocation) detection

Scikit learn interface for gensim.models.phrases.Phrases.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

>>> from gensim.sklearn_api.phrases import PhrasesTransformer
>>>
>>> # Create the model. Make sure no term is ignored and combinations seen 3+ times are captured.
>>> m = PhrasesTransformer(min_count=1, threshold=3)
>>> texts = [
...     ['I', 'love', 'computer', 'science'],
...     ['computer', 'science', 'is', 'my', 'passion'],
...     ['I', 'studied', 'computer', 'science']
... ]
>>>
>>> # Use sklearn fit_transform to see the transformation.
>>> # Since computer and science were seen together 3+ times they are considered a phrase.
>>> assert ['I', 'love', 'computer_science'] == m.fit_transform(texts)[0]
class gensim.sklearn_api.phrases.PhrasesTransformer(min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter=b'_', progress_per=10000, scoring='default', common_terms=frozenset({}))

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base Phrases module, wraps Phrases.

For more information, please have a look to Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality” and Gerlof Bouma: “Normalized (Pointwise) Mutual Information in Collocation Extraction”.

Parameters
  • min_count (int, optional) – Terms with a count lower than this will be ignored

  • threshold (float, optional) – Only phrases scoring above this will be accepted, see scoring below.

  • max_vocab_size (int, optional) – Maximum size of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM.

  • delimiter (str, optional) – Character used to join collocation tokens, should be a byte string (e.g. b’_’).

  • progress_per (int, optional) – Training will report to the logger every that many phrases are learned.

  • scoring (str or function, optional) –

    Specifies how potential phrases are scored for comparison to the threshold setting. scoring can be set with either a string that refers to a built-in scoring function, or with a function with the expected parameter names. Two built-in scoring functions are available by setting scoring to a string:

    ’npmi’ is more robust when dealing with common words that form part of common bigrams, and ranges from -1 to 1, but is slower to calculate than the default.

    To use a custom scoring function, create a function with the following parameters and set the scoring parameter to the custom function, see original_scorer() as example. You must define all the parameters (but can use only part of it):

    • worda_count: number of occurrences in sentences of the first token in the phrase being scored

    • wordb_count: number of occurrences in sentences of the second token in the phrase being scored

    • bigram_count: number of occurrences in sentences of the phrase being scored

    • len_vocab: the number of unique tokens in sentences

    • min_count: the min_count setting of the Phrases class

    • corpus_word_count: the total number of (non-unique) tokens in sentences

    A scoring function without any of these parameters (even if the parameters are not used) will raise a ValueError on initialization of the Phrases class. The scoring function must be pickleable.

  • common_terms (set of str, optional) – List of “stop words” that won’t affect frequency count of expressions containing them. Allow to detect expressions like “bank_of_america” or “eye_of_the_beholder”.

fit(X, y=None)

Fit the model according to the given training data.

Parameters

X (iterable of list of str) – Sequence of sentences to be used for training the model.

Returns

The trained model.

Return type

PhrasesTransformer

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –

  • y (ndarray of shape (n_samples,), default=None) – Target values.

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

partial_fit(X)

Train model over a potentially incomplete set of sentences.

This method can be used in two ways:
  1. On an unfitted model in which case the model is initialized and trained on X.

  2. On an already fitted model in which case the X sentences are added to the vocabulary.

Parameters

X (iterable of list of str) – Sequence of sentences to be used for training the model.

Returns

The trained model.

Return type

PhrasesTransformer

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object

transform(docs)

Transform the input documents into phrase tokens.

Words in the sentence will be joined by self.delimiter.

Parameters

docs ({iterable of list of str, list of str}) – Sequence of documents to be used transformed.

Returns

Phrase representation for each of the input sentences.

Return type

iterable of str