sklearn_api.d2vmodel – Scikit learn wrapper for paragraph2vec model

Scikit learn interface for Doc2Vec.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

>>> from gensim.test.utils import common_texts
>>> from gensim.sklearn_api import D2VTransformer
>>>
>>> model = D2VTransformer(min_count=1, size=5)
>>> docvecs = model.fit_transform(common_texts)  # represent `common_texts` as vectors
class gensim.sklearn_api.d2vmodel.D2VTransformer(dm_mean=None, dm=1, dbow_words=0, dm_concat=0, dm_tag_count=1, dv=None, dv_mapfile=None, comment=None, trim_rule=None, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, epochs=5, sorted_vocab=1, batch_words=10000)

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base Doc2Vec module, wraps Doc2Vec.

This model based on Quoc Le, Tomas Mikolov: “Distributed Representations of Sentences and Documents”.

Parameters
  • dm_mean (int {1,0}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean. Only applies when dm_concat=0.

  • dm (int {1,0}, optional) – Defines the training algorithm. If dm=1 - distributed memory (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.

  • dbow_words (int {1,0}, optional) – If set to 1 - trains word-vectors (in skip-gram fashion) simultaneous with DBOW doc-vector training, If 0, only trains doc-vectors (faster).

  • dm_concat (int {1,0}, optional) – If 1, use concatenation of context vectors rather than sum/average. Note concatenation results in a much-larger model, as the input is no longer the size of one (sampled or arithmetically combined) word vector, but the size of the tag(s) and all words in the context strung together.

  • dm_tag_count (int, optional) – Expected constant number of document tags per document, when using dm_concat mode.

  • dv (KeyedVectors) – A mapping from a string or int tag to its vector representation.

  • dv_mapfile (str, optional) – Path to a file containing the docvecs mapping. If dv is None, this file will be used to create it.

  • comment (str, optional) – A model descriptive comment, used for logging and debugging purposes.

  • trim_rule (function ((str, int, int) -> int), optional) – Vocabulary trimming rule that accepts (word, count, min_count). Specifies whether certain words should remain in the vocabulary (gensim.utils.RULE_KEEP), be trimmed away (gensim.utils.RULE_DISCARD), or handled using the default (gensim.utils.RULE_DEFAULT). If None, then gensim.utils.keep_vocab_item() will be used.

  • vector_size (int, optional) – Dimensionality of the feature vectors.

  • alpha (float, optional) – The initial learning rate.

  • window (int, optional) – The maximum distance between the current and predicted word within a sentence.

  • min_count (int, optional) – Ignores all words with total frequency lower than this.

  • max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.

  • sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).

  • seed (int, optional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (`workers=1`), to eliminate ordering jitter from OS thread scheduling. In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization.

  • workers (int, optional) – Use this many worker threads to train the model. Will yield a speedup when training with multicore machines.

  • min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.

  • hs (int {1,0}, optional) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.

  • negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

  • cbow_mean (int, optional) – Same as dm_mean, unused.

  • hashfxn (function (object -> int), optional) – A hashing function. Used to create an initial random reproducible vector by hashing the random seed.

  • epochs (int, optional) – Number of epochs to iterate through the corpus.

  • sorted_vocab (bool, optional) – Whether the vocabulary should be sorted internally.

  • batch_words (int, optional) – Number of words to be handled by each job.

fit(X, y=None)

Fit the model according to the given training data.

Parameters

X ({iterable of TaggedDocument, iterable of list of str}) – A collection of tagged documents used for training the model.

Returns

The trained model.

Return type

D2VTransformer

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –

  • y (ndarray of shape (n_samples,), default=None) – Target values.

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params – Parameter names mapped to their values.

Return type

mapping of string to any

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**params (dict) – Estimator parameters.

Returns

self – Estimator instance.

Return type

object

transform(docs)

Infer the vector representations for the input documents.

Parameters

docs ({iterable of list of str, list of str}) – Input document or sequence of documents.

Returns

The vector representation of the docs.

Return type

numpy.ndarray of shape [len(docs), size]