sklearn_api.tfidf
– Scikit learn wrapper for TF-IDF model¶
Scikit-learn interface for TfidfModel
.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api import TfIdfTransformer
>>>
>>> # Transform the word counts inversely to their global frequency using the sklearn interface.
>>> model = TfIdfTransformer(dictionary=common_dictionary)
>>> tfidf_corpus = model.fit_transform(common_corpus)
-
class
gensim.sklearn_api.tfidf.
TfIdfTransformer
(id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True, smartirs='nfc', pivot=None, slope=0.65)¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Base TfIdf module, wraps
TfidfModel
.For more information see tf-idf.
- Parameters
id2word ({dict,
Dictionary
}, optional) – Mapping from int id to word token, that was used for converting input data to bag of words format.dictionary (
Dictionary
, optional) – If specified it will be used to directly construct the inverse document frequency mapping.wlocals (function, optional) – Function for local weighting, default for wlocal is
identity()
which does nothing. Other options includemath.sqrt()
,math.log1p()
, etc.wglobal (function, optional) – Function for global weighting, default is
df2idf()
.normalize (bool, optional) – It dictates how the final transformed vectors will be normalized. normalize=True means set to unit length (default); False means don’t normalize. You can also set normalize to your own function that accepts and returns a sparse vector.
smartirs (str, optional) –
SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example ‘ntc’, ‘bpn’ and so on, where the letters represents the term weighting of the document vector.
- local_letterstr
- Term frequency weighing, one of:
b - binary,
t or n - raw,
a - augmented,
l - logarithm,
d - double logarithm,
L - log average.
- global_letterstr
- Document frequency weighting, one of:
x or n - none,
f - idf,
t - zero-corrected idf,
p - probabilistic idf.
- normalization_letterstr
- Document normalization, one of:
x or n - none,
c - cosine,
u - pivoted unique,
b - pivoted character length.
Default is nfc. For more info, visit “Wikipedia”.
pivot (float, optional) – It is the point around which the regular normalization curve is tilted to get the new pivoted normalization curve. In the paper Amit Singhal, Chris Buckley, Mandar Mitra: “Pivoted Document Length Normalization” it is the point where the retrieval and relevance curves intersect. This parameter along with slope is used for pivoted document length normalization. When pivot is None, smartirs specifies the pivoted unique document normalization scheme, and either corpus or dictionary are specified, then the pivot will be determined automatically. Otherwise, no pivoted document length normalization is applied.
slope (float, optional) – It is the parameter required by pivoted document length normalization which determines the slope to which the old normalization can be tilted. This parameter only works when pivot is defined by user and is not None.
See also
~gensim.models.tfidfmodel.TfidfModel : Class that also uses the SMART scheme. ~gensim.models.tfidfmodel.resolve_weights : Function that also uses the SMART scheme.
-
fit
(X, y=None)¶ Fit the model from the given training data.
- Parameters
X (iterable of iterable of (int, int)) – Input corpus
y (None) – Ignored. TF-IDF is an unsupervised model.
- Returns
The trained model.
- Return type
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –
y (ndarray of shape (n_samples,), default=None) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
object
-
transform
(docs)¶ Get the tf-idf scores for docs in a bag-of-words representation.
- Parameters
docs ({iterable of list of (int, number)}) – Document or corpus in bag-of-words format.
- Returns
The bag-of-words representation of each input document.
- Return type
iterable of list (int, float) 2-tuples.