sklearn_api.w2vmodel
– Scikit learn wrapper for word2vec model¶
Scikit learn interface for Word2Vec
.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
>>> from gensim.test.utils import common_texts
>>> from gensim.sklearn_api import W2VTransformer
>>>
>>> # Create a model to represent each word by a 10 dimensional vector.
>>> model = W2VTransformer(vector_size=10, min_count=1, seed=1)
>>>
>>> # What is the vector representation of the word 'graph'?
>>> wordvecs = model.fit(common_texts).transform(['graph', 'system'])
>>> assert wordvecs.shape == (2, 10)
-
class
gensim.sklearn_api.w2vmodel.
W2VTransformer
(vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, epochs=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000)¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Base Word2Vec module, wraps
Word2Vec
.For more information please have a look to Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean: “Efficient Estimation of Word Representations in Vector Space”.
- Parameters
vector_size (int) – Dimensionality of the feature vectors.
alpha (float) – The initial learning rate.
window (int) – The maximum distance between the current and predicted word within a sentence.
min_count (int) – Ignores all words with total frequency lower than this.
max_vocab_size (int) – Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.
sample (float) – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).
seed (int) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).
workers (int) – Use these many worker threads to train the model (=faster training with multicore machines).
min_alpha (float) – Learning rate will linearly drop to min_alpha as training progresses.
sg (int {1, 0}) – Defines the training algorithm. If 1, CBOW is used, otherwise, skip-gram is employed.
hs (int {1,0}) – If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.
negative (int) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
cbow_mean (int {1,0}) – If 0, use the sum of the context word vectors. If 1, use the mean, only applies when cbow is used.
hashfxn (callable (object -> int), optional) – A hashing function. Used to create an initial random reproducible vector by hashing the random seed.
epochs (int) – Number of iterations (epochs) over the corpus.
null_word (int {1, 0}) – If 1, a null pseudo-word will be created for padding when using concatenative L1 (run-of-words)
trim_rule (function) – Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to
keep_vocab_item()
), or a callable that accepts parameters (word, count, min_count) and returns eithergensim.utils.RULE_DISCARD
,gensim.utils.RULE_KEEP
orgensim.utils.RULE_DEFAULT
. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.sorted_vocab (int {1,0}) – If 1, sort the vocabulary by descending frequency before assigning word indexes.
batch_words (int) – Target size (in words) for batches of examples passed to worker threads (and thus cython routines).(Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)
-
fit
(X, y=None)¶ Fit the model according to the given training data.
- Parameters
X (iterable of iterables of str) – The input corpus. X can be simply a list of lists of tokens, but for larger corpora, consider an iterable that streams the sentences directly from disk/network. See
BrownCorpus
,Text8Corpus
orLineSentence
inword2vec
module for such examples.- Returns
The trained model.
- Return type
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –
y (ndarray of shape (n_samples,), default=None) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
partial_fit
(X)¶
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
object
-
transform
(words)¶ Get the word vectors the input words.
- Parameters
words ({iterable of str, str}) – Word or a collection of words to be transformed.
- Returns
A 2D array where each row is the vector of one word.
- Return type
np.ndarray of shape [len(words), vector_size]