similarities.termsim
– Term similarity queries¶
This module provides classes that deal with term similarities.
-
class
gensim.similarities.termsim.
SparseTermSimilarityMatrix
(source, dictionary=None, tfidf=None, symmetric=True, dominant=False, nonzero_limit=100, dtype=<class 'numpy.float32'>)¶ Builds a sparse term similarity matrix using a term similarity index.
Examples
>>> from gensim.test.utils import common_texts >>> from gensim.corpora import Dictionary >>> from gensim.models import Word2Vec, WordEmbeddingSimilarityIndex >>> from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix >>> from gensim.similarities.index import AnnoyIndexer >>> from scikits.sparse.cholmod import cholesky >>> >>> model = Word2Vec(common_texts, size=20, min_count=1) # train word-vectors >>> annoy = AnnoyIndexer(model, num_trees=2) # use annoy for faster word similarity lookups >>> termsim_index = WordEmbeddingSimilarityIndex(model.wv, kwargs={'indexer': annoy}) >>> dictionary = Dictionary(common_texts) >>> bow_corpus = [dictionary.doc2bow(document) for document in common_texts] >>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary, symmetric=True, dominant=True) >>> docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10) >>> >>> query = 'graph trees computer'.split() # make a query >>> sims = docsim_index[dictionary.doc2bow(query)] # calculate similarity of query to each doc from bow_corpus >>> >>> word_embeddings = cholesky(similarity_matrix.matrix).L() # obtain word embeddings from similarity matrix
Check out Tutorial Notebook for more examples.
- Parameters
source (
TermSimilarityIndex
orscipy.sparse.spmatrix
) – The source of the term similarity. Either a term similarity index that will be used for building the term similarity matrix, or an existing sparse term similarity matrix that will be encapsulated and stored in the matrix attribute. When a matrix is specified as the source, any other parameters will be ignored.dictionary (
Dictionary
or None, optional) – A dictionary that specifies a mapping between terms and the indices of rows and columns of the resulting term similarity matrix. The dictionary may only be None when source is ascipy.sparse.spmatrix
.tfidf (
gensim.models.tfidfmodel.TfidfModel
or None, optional) – A model that specifies the relative importance of the terms in the dictionary. The columns of the term similarity matrix will be build in a decreasing order of importance of terms, or in the order of term identifiers if None.symmetric (bool, optional) – Whether the symmetry of the term similarity matrix will be enforced. Symmetry is a necessary precondition for positive definiteness, which is necessary if you later wish to derive a unique change-of-basis matrix from the term similarity matrix using Cholesky factorization. Setting symmetric to False will significantly reduce memory usage during matrix construction.
dominant (bool, optional) – Whether the strict column diagonal dominance of the term similarity matrix will be enforced. Strict diagonal dominance and symmetry are sufficient preconditions for positive definiteness, which is necessary if you later wish to derive a change-of-basis matrix from the term similarity matrix using Cholesky factorization.
nonzero_limit (int or None, optional) – The maximum number of non-zero elements outside the diagonal in a single column of the sparse term similarity matrix. If None, then no limit will be imposed.
dtype (numpy.dtype, optional) – The data type of the sparse term similarity matrix.
-
matrix
¶ The encapsulated sparse term similarity matrix.
- Type
scipy.sparse.csc_matrix
- Raises
ValueError – If dictionary is empty.
-
inner_product
(X, Y, normalized=(False, False))¶ Get the inner product(s) between real vectors / corpora X and Y.
Return the inner product(s) between real vectors / corpora vec1 and vec2 expressed in a non-orthogonal normalized basis, where the dot product between the basis vectors is given by the sparse term similarity matrix.
- Parameters
vec1 (list of (int, float) or iterable of list of (int, float)) – A query vector / corpus in the sparse bag-of-words format.
vec2 (list of (int, float) or iterable of list of (int, float)) – A document vector / corpus in the sparse bag-of-words format.
normalized (tuple of {True, False, 'maintain'}, optional) – First/second value specifies whether the query/document vectors in the inner product will be L2-normalized (True; corresponds to the soft cosine measure), maintain their L2-norm during change of basis (‘maintain’; corresponds to query expansion with partial membership), or kept as-is (False; corresponds to query expansion; default).
- Returns
The inner product(s) between X and Y.
- Return type
self.matrix.dtype, scipy.sparse.csr_matrix, or
numpy.matrix
References
The soft cosine measure was perhaps first described by [sidorovetal14]. Further notes on the efficient implementation of the soft cosine measure are described by [novotny18].
- sidorovetal14
Grigori Sidorov et al., “Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model”, 2014, http://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/view/2043/1921.
- novotny18
Vít Novotný, “Implementation Notes for the Soft Cosine Measure”, 2018, http://dx.doi.org/10.1145/3269206.3269317.
-
classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
-
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶ Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
-
class
gensim.similarities.termsim.
TermSimilarityIndex
¶ Base class = common interface for retrieving the most similar terms for a given term.
See also
SparseTermSimilarityMatrix
Build a term similarity matrix and compute the Soft Cosine Measure.
-
classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
-
most_similar
(term, topn=10)¶ Get most similar terms for a given term.
Return the most similar terms for a given term along with their similarities.
- Parameters
term (str) – The term for which we are retrieving topn most similar terms.
topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.
- Returns
Most similar terms along with their similarities to term. Only terms distinct from term must be returned.
- Return type
iterable of (str, float)
-
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶ Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
-
class
gensim.similarities.termsim.
UniformTermSimilarityIndex
(dictionary, term_similarity=0.5)¶ Retrieves most similar terms for a given term under the hypothesis that the similarities between distinct terms are uniform.
- Parameters
dictionary (
Dictionary
) – A dictionary that specifies the considered terms.term_similarity (float, optional) – The uniform similarity between distinct terms.
See also
SparseTermSimilarityMatrix
Build a term similarity matrix and compute the Soft Cosine Measure.
Notes
This class is mainly intended for testing SparseTermSimilarityMatrix and other classes that depend on the TermSimilarityIndex.
-
classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
-
most_similar
(t1, topn=10)¶ Get most similar terms for a given term.
Return the most similar terms for a given term along with their similarities.
- Parameters
term (str) – The term for which we are retrieving topn most similar terms.
topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.
- Returns
Most similar terms along with their similarities to term. Only terms distinct from term must be returned.
- Return type
iterable of (str, float)
-
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶ Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.
-
class
gensim.similarities.termsim.
WordEmbeddingSimilarityIndex
(keyedvectors, threshold=0.0, exponent=2.0, kwargs=None)¶ Use objects of this class to:
Compute cosine similarities between word embeddings.
Retrieve the closest word embeddings (by cosine similarity) to a given word embedding.
- Parameters
keyedvectors (
KeyedVectors
) – The word embeddings.threshold (float, optional) – Only embeddings more similar than threshold are considered when retrieving word embeddings closest to a given word embedding.
exponent (float, optional) – Take the word embedding similarities larger than threshold to the power of exponent.
kwargs (dict or None) – A dict with keyword arguments that will be passed to the keyedvectors.most_similar method when retrieving the word embeddings closest to a given word embedding.
See also
SparseTermSimilarityMatrix
Build a term similarity matrix and compute the Soft Cosine Measure.
-
classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
-
most_similar
(t1, topn=10)¶ Get most similar terms for a given term.
Return the most similar terms for a given term along with their similarities.
- Parameters
term (str) – The term for which we are retrieving topn most similar terms.
topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.
- Returns
Most similar terms along with their similarities to term. Only terms distinct from term must be returned.
- Return type
iterable of (str, float)
-
save
(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)¶ Save the object to a file.
- Parameters
fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.
separately (list of str or None, optional) –
If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.
If list of str: store these attributes into separate files. The automated size check is not performed in this case.
sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.
ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.
pickle_protocol (int, optional) – Protocol number for pickle.
See also
load()
Load object from file.