models.tfidfmodel
– TF-IDF model¶
This module implements functionality related to the Term Frequency - Inverse Document Frequency <https://en.wikipedia.org/wiki/Tf%E2%80%93idf> vector space bag-of-words models.
For a more in-depth exposition of TF-IDF and its various SMART variants (normalization, weighting schemes), see the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/
-
class
gensim.models.tfidfmodel.
TfidfModel
(corpus=None, id2word=None, dictionary=None, wlocal=<function identity>, wglobal=<function df2idf>, normalize=True, smartirs=None, pivot=None, slope=0.25)¶ Bases:
gensim.interfaces.TransformationABC
Objects of this class realize the transformation between word-document co-occurrence matrix (int) into a locally/globally weighted TF-IDF matrix (positive floats).
Examples
>>> import gensim.downloader as api >>> from gensim.models import TfidfModel >>> from gensim.corpora import Dictionary >>> >>> dataset = api.load("text8") >>> dct = Dictionary(dataset) # fit dictionary >>> corpus = [dct.doc2bow(line) for line in dataset] # convert corpus to BoW format >>> >>> model = TfidfModel(corpus) # fit model >>> vector = model[corpus[0]] # apply model to the first corpus document
Compute TF-IDF by multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length. Formula for non-normalized weight of term in document in a corpus of documents
or, more generally
so you can plug in your own custom and functions.
- Parameters
corpus (iterable of iterable of (int, int), optional) – Input corpus
id2word ({dict,
Dictionary
}, optional) – Mapping token - id, that was used for converting input data to bag of words format.dictionary (
Dictionary
) – If dictionary is specified, it must be a corpora.Dictionary object and it will be used. to directly construct the inverse document frequency mapping (then corpus, if specified, is ignored).wlocals (callable, optional) – Function for local weighting, default for wlocal is
identity()
(other options:numpy.sqrt()
, lambda tf: 0.5 + (0.5 * tf / tf.max()), etc.).wglobal (callable, optional) – Function for global weighting, default is
df2idf()
.normalize ({bool, callable}, optional) – Normalize document vectors to unit euclidean length? You can also inject your own function into normalize.
smartirs (str, optional) –
SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ, for example ‘ntc’, ‘bpn’ and so on, where the letters represents the term weighting of the document vector.
- Term frequency weighing:
b - binary,
t or n - raw,
a - augmented,
l - logarithm,
d - double logarithm,
L - log average.
- Document frequency weighting:
x or n - none,
f - idf,
t - zero-corrected idf,
p - probabilistic idf.
- Document normalization:
x or n - none,
c - cosine,
u - pivoted unique,
b - pivoted character length.
Default is ‘nfc’. For more information visit SMART Information Retrieval System.
pivot (float or None, optional) –
In information retrieval, TF-IDF is biased against long documents 1. Pivoted document length normalization solves this problem by changing the norm of a document to slope * old_norm + (1.0 - slope) * pivot.
You can either set the pivot by hand, or you can let Gensim figure it out automatically with the following two steps:
Set either the u or b document normalization in the smartirs parameter.
Set either the corpus or dictionary parameter. The pivot will be automatically determined from the properties of the corpus or dictionary.
If pivot is None and you don’t follow steps 1 and 2, then pivoted document length normalization will be disabled. Default is None.
See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.
slope (float, optional) –
In information retrieval, TF-IDF is biased against long documents 1. Pivoted document length normalization solves this problem by changing the norm of a document to slope * old_norm + (1.0 - slope) * pivot.
Setting the slope to 0.0 uses only the pivot as the norm, and setting the slope to 1.0 effectively disables pivoted document length normalization. Singhal 2 suggests setting the slope between 0.2 and 0.3 for best results. Default is 0.25.
See also the blog post at https://rare-technologies.com/pivoted-document-length-normalisation/.
See also
~gensim.sklearn_api.tfidf.TfIdfTransformer : Class that also uses the SMART scheme. resolve_weights : Function that also uses the SMART scheme.
References
- 1(1,2)
Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted Document Length Normalization. SIGIR Forum, 51, 176–184.
- 2
Singhal, A. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4), 35–43.
-
__getitem__
(bow, eps=1e-12)¶ Get the tf-idf representation of an input vector and/or corpus.
- bow{list of (int, int), iterable of iterable of (int, int)}
Input document in the sparse Gensim bag-of-words format, or a streamed corpus of such documents.
- epsfloat
Threshold value, will remove all position that have tfidf-value less than eps.
- Returns
vector (list of (int, float)) – TfIdf vector, if bow is a single document
TransformedCorpus
– TfIdf corpus, if bow is a corpus.
-
initialize
(corpus)¶ Compute inverse document weights, which will be used to modify term frequencies for documents.
- Parameters
corpus (iterable of iterable of (int, int)) – Input corpus.
-
classmethod
load
(*args, **kwargs)¶ Load a previously saved TfidfModel class. Handles backwards compatibility from older TfidfModel versions which did not use pivoted document normalization.
-
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶ Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
-
gensim.models.tfidfmodel.
df2idf
(docfreq, totaldocs, log_base=2.0, add=0.0)¶ Compute inverse-document-frequency for a term with the given document frequency docfreq:
- Parameters
docfreq ({int, float}) – Document frequency.
totaldocs (int) – Total number of documents.
log_base (float, optional) – Base of logarithm.
add (float, optional) – Offset.
- Returns
Inverse document frequency.
- Return type
float
-
gensim.models.tfidfmodel.
precompute_idfs
(wglobal, dfs, total_docs)¶ Pre-compute the inverse document frequency mapping for all terms.
- Parameters
wglobal (function) – Custom function for calculating the “global” weighting function. See for example the SMART alternatives under
smartirs_wglobal()
.dfs (dict) – Dictionary mapping term_id into how many documents did that term appear in.
total_docs (int) – Total number of documents.
- Returns
Inverse document frequencies in the format {term_id_1: idfs_1, term_id_2: idfs_2, …}.
- Return type
dict of (int, float)
-
gensim.models.tfidfmodel.
resolve_weights
(smartirs)¶ Check the validity of smartirs parameters.
- Parameters
smartirs (str) –
smartirs or SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form ddd, where the letters represents the term weighting of the document vector. for more information visit SMART Information Retrieval System.
- Returns
str of (local_letter, global_letter, normalization_letter)
local_letter (str) –
- Term frequency weighing, one of:
b - binary,
t or n - raw,
a - augmented,
l - logarithm,
d - double logarithm,
L - log average.
global_letter (str) –
- Document frequency weighting, one of:
x or n - none,
f - idf,
t - zero-corrected idf,
p - probabilistic idf.
normalization_letter (str) –
- Document normalization, one of:
x or n - none,
c - cosine,
u - pivoted unique,
b - pivoted character length.
- Raises
ValueError – If smartirs is not a string of length 3 or one of the decomposed value doesn’t fit the list of permissible values.
See also
~gensim.sklearn_api.tfidf.TfIdfTransformer, TfidfModel : Classes that also use the SMART scheme.
-
gensim.models.tfidfmodel.
smartirs_normalize
(x, norm_scheme, return_norm=False)¶ Normalize a vector using the normalization scheme specified in norm_scheme.
- Parameters
x (numpy.ndarray) – The tf-idf vector.
norm_scheme ({'n', 'c'}) – Document length normalization scheme.
return_norm (bool, optional) – Return the length of x as well?
- Returns
numpy.ndarray – Normalized array.
float (only if return_norm is set) – Norm of x.
-
gensim.models.tfidfmodel.
smartirs_wglobal
(docfreq, totaldocs, global_scheme)¶ Calculate global document weight based on the weighting scheme specified in global_scheme.
- Parameters
docfreq (int) – Document frequency.
totaldocs (int) – Total number of documents.
global_scheme ({'n', 'f', 't', 'p'}) – Global transformation scheme.
- Returns
Calculated global weight.
- Return type
float
-
gensim.models.tfidfmodel.
smartirs_wlocal
(tf, local_scheme)¶ Calculate local term weight for a term using the weighting scheme specified in local_scheme.
- Parameters
tf (int) – Term frequency.
local ({'b', 'n', 'a', 'l', 'd', 'L'}) – Local transformation scheme.
- Returns
Calculated local weight.
- Return type
float