sklearn_api.d2vmodel
– Scikit learn wrapper for paragraph2vec model¶
Scikit learn interface for Doc2Vec
.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
>>> from gensim.test.utils import common_texts
>>> from gensim.sklearn_api import D2VTransformer
>>>
>>> model = D2VTransformer(min_count=1, size=5)
>>> docvecs = model.fit_transform(common_texts) # represent `common_texts` as vectors
-
class
gensim.sklearn_api.d2vmodel.
D2VTransformer
(dm_mean=None, dm=1, dbow_words=0, dm_concat=0, dm_tag_count=1, dv=None, dv_mapfile=None, comment=None, trim_rule=None, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, epochs=5, sorted_vocab=1, batch_words=10000)¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Base Doc2Vec module, wraps
Doc2Vec
.This model based on Quoc Le, Tomas Mikolov: “Distributed Representations of Sentences and Documents”.
- Parameters
dm_mean (int {1,0}, optional) – If 0, use the sum of the context word vectors. If 1, use the mean. Only applies when dm_concat=0.
dm (int {1,0}, optional) – Defines the training algorithm. If dm=1 - distributed memory (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.
dbow_words (int {1,0}, optional) – If set to 1 - trains word-vectors (in skip-gram fashion) simultaneous with DBOW doc-vector training, If 0, only trains doc-vectors (faster).
dm_concat (int {1,0}, optional) – If 1, use concatenation of context vectors rather than sum/average. Note concatenation results in a much-larger model, as the input is no longer the size of one (sampled or arithmetically combined) word vector, but the size of the tag(s) and all words in the context strung together.
dm_tag_count (int, optional) – Expected constant number of document tags per document, when using dm_concat mode.
dv (
KeyedVectors
) – A mapping from a string or int tag to its vector representation.dv_mapfile (str, optional) – Path to a file containing the docvecs mapping. If dv is None, this file will be used to create it.
comment (str, optional) – A model descriptive comment, used for logging and debugging purposes.
trim_rule (function ((str, int, int) -> int), optional) – Vocabulary trimming rule that accepts (word, count, min_count). Specifies whether certain words should remain in the vocabulary (
gensim.utils.RULE_KEEP
), be trimmed away (gensim.utils.RULE_DISCARD
), or handled using the default (gensim.utils.RULE_DEFAULT
). If None, thengensim.utils.keep_vocab_item()
will be used.vector_size (int, optional) – Dimensionality of the feature vectors.
alpha (float, optional) – The initial learning rate.
window (int, optional) – The maximum distance between the current and predicted word within a sentence.
min_count (int, optional) – Ignores all words with total frequency lower than this.
max_vocab_size (int, optional) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
sample (float, optional) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
seed (int, optional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (`workers=1`), to eliminate ordering jitter from OS thread scheduling. In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization.
workers (int, optional) – Use this many worker threads to train the model. Will yield a speedup when training with multicore machines.
min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.
hs (int {1,0}, optional) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.
negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
cbow_mean (int, optional) – Same as dm_mean, unused.
hashfxn (function (object -> int), optional) – A hashing function. Used to create an initial random reproducible vector by hashing the random seed.
epochs (int, optional) – Number of epochs to iterate through the corpus.
sorted_vocab (bool, optional) – Whether the vocabulary should be sorted internally.
batch_words (int, optional) – Number of words to be handled by each job.
-
fit
(X, y=None)¶ Fit the model according to the given training data.
- Parameters
X ({iterable of
TaggedDocument
, iterable of list of str}) – A collection of tagged documents used for training the model.- Returns
The trained model.
- Return type
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –
y (ndarray of shape (n_samples,), default=None) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
object
-
transform
(docs)¶ Infer the vector representations for the input documents.
- Parameters
docs ({iterable of list of str, list of str}) – Input document or sequence of documents.
- Returns
The vector representation of the docs.
- Return type
numpy.ndarray of shape [len(docs), size]