sklearn_api.phrases
– Scikit learn wrapper for phrase (collocation) detection¶
Scikit learn interface for gensim.models.phrases.Phrases.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
>>> from gensim.sklearn_api.phrases import PhrasesTransformer
>>>
>>> # Create the model. Make sure no term is ignored and combinations seen 3+ times are captured.
>>> m = PhrasesTransformer(min_count=1, threshold=3)
>>> texts = [
... ['I', 'love', 'computer', 'science'],
... ['computer', 'science', 'is', 'my', 'passion'],
... ['I', 'studied', 'computer', 'science']
... ]
>>>
>>> # Use sklearn fit_transform to see the transformation.
>>> # Since computer and science were seen together 3+ times they are considered a phrase.
>>> assert ['I', 'love', 'computer_science'] == m.fit_transform(texts)[0]
-
class
gensim.sklearn_api.phrases.
PhrasesTransformer
(min_count=5, threshold=10.0, max_vocab_size=40000000, delimiter=b'_', progress_per=10000, scoring='default', common_terms=frozenset({}))¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Base Phrases module, wraps
Phrases
.For more information, please have a look to Mikolov, et. al: “Distributed Representations of Words and Phrases and their Compositionality” and Gerlof Bouma: “Normalized (Pointwise) Mutual Information in Collocation Extraction”.
- Parameters
min_count (int, optional) – Terms with a count lower than this will be ignored
threshold (float, optional) – Only phrases scoring above this will be accepted, see scoring below.
max_vocab_size (int, optional) – Maximum size of the vocabulary. Used to control pruning of less common words, to keep memory under control. The default of 40M needs about 3.6GB of RAM.
delimiter (str, optional) – Character used to join collocation tokens, should be a byte string (e.g. b’_’).
progress_per (int, optional) – Training will report to the logger every that many phrases are learned.
scoring (str or function, optional) –
Specifies how potential phrases are scored for comparison to the threshold setting. scoring can be set with either a string that refers to a built-in scoring function, or with a function with the expected parameter names. Two built-in scoring functions are available by setting scoring to a string:
’npmi’ is more robust when dealing with common words that form part of common bigrams, and ranges from -1 to 1, but is slower to calculate than the default.
To use a custom scoring function, create a function with the following parameters and set the scoring parameter to the custom function, see
original_scorer()
as example. You must define all the parameters (but can use only part of it):worda_count: number of occurrences in sentences of the first token in the phrase being scored
wordb_count: number of occurrences in sentences of the second token in the phrase being scored
bigram_count: number of occurrences in sentences of the phrase being scored
len_vocab: the number of unique tokens in sentences
min_count: the min_count setting of the Phrases class
corpus_word_count: the total number of (non-unique) tokens in sentences
A scoring function without any of these parameters (even if the parameters are not used) will raise a ValueError on initialization of the Phrases class. The scoring function must be pickleable.
common_terms (set of str, optional) – List of “stop words” that won’t affect frequency count of expressions containing them. Allow to detect expressions like “bank_of_america” or “eye_of_the_beholder”.
-
fit
(X, y=None)¶ Fit the model according to the given training data.
- Parameters
X (iterable of list of str) – Sequence of sentences to be used for training the model.
- Returns
The trained model.
- Return type
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –
y (ndarray of shape (n_samples,), default=None) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
partial_fit
(X)¶ Train model over a potentially incomplete set of sentences.
- This method can be used in two ways:
On an unfitted model in which case the model is initialized and trained on X.
On an already fitted model in which case the X sentences are added to the vocabulary.
- Parameters
X (iterable of list of str) – Sequence of sentences to be used for training the model.
- Returns
The trained model.
- Return type
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
object
-
transform
(docs)¶ Transform the input documents into phrase tokens.
Words in the sentence will be joined by self.delimiter.
- Parameters
docs ({iterable of list of str, list of str}) – Sequence of documents to be used transformed.
- Returns
Phrase representation for each of the input sentences.
- Return type
iterable of str