`sklearn_api.tfidf` – Scikit learn wrapper for TF-IDF model¶

Scikit-learn interface for TfidfModel.

Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.

Examples

>>> from gensim.test.utils import common_corpus, common_dictionary
>>> from gensim.sklearn_api import TfIdfTransformer
>>>
>>> # Transform the word counts inversely to their global frequency using the sklearn interface.
>>> model = TfIdfTransformer(dictionary=common_dictionary)
>>> tfidf_corpus = model.fit_transform(common_corpus)

class gensim.sklearn_api.tfidf.TfIdfTransformer(id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True, smartirs='nfc', pivot=None, slope=0.65)¶

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Base TfIdf module, wraps TfidfModel.

For more information see tf-idf.

Parameters

id2word ({dict, Dictionary}, optional) – Mapping from int id to word token, that was used for converting input data to bag of words format.
dictionary (Dictionary, optional) – If specified it will be used to directly construct the inverse document frequency mapping.
wlocals (function, optional) – Function for local weighting, default for wlocal is identity() which does nothing. Other options include math.sqrt(), math.log1p(), etc.
wglobal (function, optional) – Function for global weighting, default is df2idf().
normalize (bool, optional) – It dictates how the final transformed vectors will be normalized. normalize=True means set to unit length (default); False means don’t normalize. You can also set normalize to your own function that accepts and returns a sparse vector.
smartirs (str, optional) –
SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example ‘ntc’, ‘bpn’ and so on, where the letters represents the term weighting of the document vector.
local_letterstr
Term frequency weighing, one of:
b - binary,

t or n - raw,

a - augmented,

l - logarithm,

d - double logarithm,

L - log average.
global_letterstr
Document frequency weighting, one of:
x or n - none,

f - idf,

t - zero-corrected idf,

p - probabilistic idf.
normalization_letterstr
Document normalization, one of:
x or n - none,

c - cosine,

u - pivoted unique,

b - pivoted character length.
Default is nfc. For more info, visit “Wikipedia”.
pivot (float, optional) – It is the point around which the regular normalization curve is tilted to get the new pivoted normalization curve. In the paper Amit Singhal, Chris Buckley, Mandar Mitra: “Pivoted Document Length Normalization” it is the point where the retrieval and relevance curves intersect. This parameter along with slope is used for pivoted document length normalization. When pivot is None, smartirs specifies the pivoted unique document normalization scheme, and either corpus or dictionary are specified, then the pivot will be determined automatically. Otherwise, no pivoted document length normalization is applied.
slope (float, optional) – It is the parameter required by pivoted document length normalization which determines the slope to which the old normalization can be tilted. This parameter only works when pivot is defined by user and is not None.

Get Expert Help From The Gensim Authors

sklearn_api.tfidf – Scikit learn wrapper for TF-IDF model¶

`sklearn_api.tfidf` – Scikit learn wrapper for TF-IDF model¶