similarities.termsim – Term similarity queries

This module provides classes that deal with term similarities.

class gensim.similarities.termsim.SparseTermSimilarityMatrix(source, dictionary=None, tfidf=None, symmetric=True, dominant=False, nonzero_limit=100, dtype=<class 'numpy.float32'>)

Builds a sparse term similarity matrix using a term similarity index.

Examples

>>> from gensim.test.utils import common_texts
>>> from gensim.corpora import Dictionary
>>> from gensim.models import Word2Vec, WordEmbeddingSimilarityIndex
>>> from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix
>>> from gensim.similarities.index import AnnoyIndexer
>>> from scikits.sparse.cholmod import cholesky
>>>
>>> model = Word2Vec(common_texts, size=20, min_count=1)  # train word-vectors
>>> annoy = AnnoyIndexer(model, num_trees=2)  # use annoy for faster word similarity lookups
>>> termsim_index = WordEmbeddingSimilarityIndex(model.wv, kwargs={'indexer': annoy})
>>> dictionary = Dictionary(common_texts)
>>> bow_corpus = [dictionary.doc2bow(document) for document in common_texts]
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary, symmetric=True, dominant=True)
>>> docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
>>>
>>> query = 'graph trees computer'.split()  # make a query
>>> sims = docsim_index[dictionary.doc2bow(query)]  # calculate similarity of query to each doc from bow_corpus
>>>
>>> word_embeddings = cholesky(similarity_matrix.matrix).L()  # obtain word embeddings from similarity matrix

Check out Tutorial Notebook for more examples.

Parameters
  • source (TermSimilarityIndex or scipy.sparse.spmatrix) – The source of the term similarity. Either a term similarity index that will be used for building the term similarity matrix, or an existing sparse term similarity matrix that will be encapsulated and stored in the matrix attribute. When a matrix is specified as the source, any other parameters will be ignored.

  • dictionary (Dictionary or None, optional) – A dictionary that specifies a mapping between terms and the indices of rows and columns of the resulting term similarity matrix. The dictionary may only be None when source is a scipy.sparse.spmatrix.

  • tfidf (gensim.models.tfidfmodel.TfidfModel or None, optional) – A model that specifies the relative importance of the terms in the dictionary. The columns of the term similarity matrix will be build in a decreasing order of importance of terms, or in the order of term identifiers if None.

  • symmetric (bool, optional) – Whether the symmetry of the term similarity matrix will be enforced. Symmetry is a necessary precondition for positive definiteness, which is necessary if you later wish to derive a unique change-of-basis matrix from the term similarity matrix using Cholesky factorization. Setting symmetric to False will significantly reduce memory usage during matrix construction.

  • dominant (bool, optional) – Whether the strict column diagonal dominance of the term similarity matrix will be enforced. Strict diagonal dominance and symmetry are sufficient preconditions for positive definiteness, which is necessary if you later wish to derive a change-of-basis matrix from the term similarity matrix using Cholesky factorization.

  • nonzero_limit (int or None, optional) – The maximum number of non-zero elements outside the diagonal in a single column of the sparse term similarity matrix. If None, then no limit will be imposed.

  • dtype (numpy.dtype, optional) – The data type of the sparse term similarity matrix.

matrix

The encapsulated sparse term similarity matrix.

Type

scipy.sparse.csc_matrix

Raises

ValueError – If dictionary is empty.

inner_product(X, Y, normalized=(False, False))

Get the inner product(s) between real vectors / corpora X and Y.

Return the inner product(s) between real vectors / corpora vec1 and vec2 expressed in a non-orthogonal normalized basis, where the dot product between the basis vectors is given by the sparse term similarity matrix.

Parameters
  • vec1 (list of (int, float) or iterable of list of (int, float)) – A query vector / corpus in the sparse bag-of-words format.

  • vec2 (list of (int, float) or iterable of list of (int, float)) – A document vector / corpus in the sparse bag-of-words format.

  • normalized (tuple of {True, False, 'maintain'}, optional) – First/second value specifies whether the query/document vectors in the inner product will be L2-normalized (True; corresponds to the soft cosine measure), maintain their L2-norm during change of basis (‘maintain’; corresponds to query expansion with partial membership), or kept as-is (False; corresponds to query expansion; default).

Returns

The inner product(s) between X and Y.

Return type

self.matrix.dtype, scipy.sparse.csr_matrix, or numpy.matrix

References

The soft cosine measure was perhaps first described by [sidorovetal14]. Further notes on the efficient implementation of the soft cosine measure are described by [novotny18].

sidorovetal14

Grigori Sidorov et al., “Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model”, 2014, http://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/view/2043/1921.

novotny18

Vít Novotný, “Implementation Notes for the Soft Cosine Measure”, 2018, http://dx.doi.org/10.1145/3269206.3269317.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

class gensim.similarities.termsim.TermSimilarityIndex

Base class = common interface for retrieving the most similar terms for a given term.

See also

SparseTermSimilarityMatrix

Build a term similarity matrix and compute the Soft Cosine Measure.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

most_similar(term, topn=10)

Get most similar terms for a given term.

Return the most similar terms for a given term along with their similarities.

Parameters
  • term (str) – The term for which we are retrieving topn most similar terms.

  • topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.

Returns

Most similar terms along with their similarities to term. Only terms distinct from term must be returned.

Return type

iterable of (str, float)

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

class gensim.similarities.termsim.UniformTermSimilarityIndex(dictionary, term_similarity=0.5)

Retrieves most similar terms for a given term under the hypothesis that the similarities between distinct terms are uniform.

Parameters
  • dictionary (Dictionary) – A dictionary that specifies the considered terms.

  • term_similarity (float, optional) – The uniform similarity between distinct terms.

See also

SparseTermSimilarityMatrix

Build a term similarity matrix and compute the Soft Cosine Measure.

Notes

This class is mainly intended for testing SparseTermSimilarityMatrix and other classes that depend on the TermSimilarityIndex.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

most_similar(t1, topn=10)

Get most similar terms for a given term.

Return the most similar terms for a given term along with their similarities.

Parameters
  • term (str) – The term for which we are retrieving topn most similar terms.

  • topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.

Returns

Most similar terms along with their similarities to term. Only terms distinct from term must be returned.

Return type

iterable of (str, float)

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

class gensim.similarities.termsim.WordEmbeddingSimilarityIndex(keyedvectors, threshold=0.0, exponent=2.0, kwargs=None)

Use objects of this class to:

  1. Compute cosine similarities between word embeddings.

  2. Retrieve the closest word embeddings (by cosine similarity) to a given word embedding.

Parameters
  • keyedvectors (KeyedVectors) – The word embeddings.

  • threshold (float, optional) – Only embeddings more similar than threshold are considered when retrieving word embeddings closest to a given word embedding.

  • exponent (float, optional) – Take the word embedding similarities larger than threshold to the power of exponent.

  • kwargs (dict or None) – A dict with keyword arguments that will be passed to the keyedvectors.most_similar method when retrieving the word embeddings closest to a given word embedding.

See also

SparseTermSimilarityMatrix

Build a term similarity matrix and compute the Soft Cosine Measure.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

most_similar(t1, topn=10)

Get most similar terms for a given term.

Return the most similar terms for a given term along with their similarities.

Parameters
  • term (str) – The term for which we are retrieving topn most similar terms.

  • topn (int, optional) – The maximum number of most similar terms to term that will be retrieved.

Returns

Most similar terms along with their similarities to term. Only terms distinct from term must be returned.

Return type

iterable of (str, float)

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.