similarities.nmslib – Approximate Vector Search using NMSLIB¶
This module integrates NMSLIB fast similarity
search with Gensim’s Word2Vec, Doc2Vec,
FastText and KeyedVectors
vector embeddings.
Important
To use this module, you must have the external nmslib library installed.
To install it, run pip install nmslib.
To use the integration, instantiate a NmslibIndexer class
and pass the instance as the indexer parameter to your model’s model.most_similar() method.
Example usage¶
>>> from gensim.similarities.nmslib import NmslibIndexer
>>> from gensim.models import Word2Vec
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
>>> model = Word2Vec(sentences, min_count=1, iter=10, seed=2)
>>>
>>> indexer = NmslibIndexer(model)
>>> model.wv.most_similar("cat", topn=2, indexer=indexer)
[('cat', 1.0), ('meow', 0.16398882865905762)]
Load and save example¶
>>> from gensim.similarities.nmslib import NmslibIndexer
>>> from gensim.models import Word2Vec
>>> from tempfile import mkstemp
>>>
>>> sentences = [['cute', 'cat', 'say', 'meow'], ['cute', 'dog', 'say', 'woof']]
>>> model = Word2Vec(sentences, min_count=1, seed=2, iter=10)
>>>
>>> indexer = NmslibIndexer(model)
>>> _, temp_fn = mkstemp()
>>> indexer.save(temp_fn)
>>>
>>> new_indexer = NmslibIndexer.load(temp_fn)
>>> model.wv.most_similar("cat", topn=2, indexer=new_indexer)
[('cat', 1.0), ('meow', 0.5595494508743286)]
What is NMSLIB¶
Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. More information about NMSLIB: github repository.
Why use NMSIB?¶
Gensim’s native Similarity for finding the k nearest neighbors to a vector
uses brute force and has linear complexity, albeit with extremely low constant factors.
The retrieved results are exact, which is an overkill in many applications: approximate results retrieved in sub-linear time may be enough.
NMSLIB can find approximate nearest neighbors much faster, similar to Spotify’s Annoy library.
Compared to Annoy, NMSLIB has more parameters to
control the build and query time and accuracy. NMSLIB often achieves faster and more accurate
nearest neighbors search than Annoy.
-
class
gensim.similarities.nmslib.NmslibIndexer(model, index_params=None, query_time_params=None)¶ This class allows to use NMSLIB as indexer for most_similar method from
Word2Vec,Doc2Vec,FastTextandWord2VecKeyedVectorsclasses.- Parameters
model (
BaseWordEmbeddingsModel) – Model, that will be used as source for index.index_params (dict, optional) –
Indexing parameters passed through to NMSLIB: https://github.com/nmslib/nmslib/blob/master/manual/methods.md#graph-based-search-methods-sw-graph-and-hnsw
If not specified, defaults to {‘M’: 100, ‘indexThreadQty’: 1, ‘efConstruction’: 100, ‘post’: 0}.
query_time_params (dict, optional) – query_time_params for NMSLIB indexer. If not specified, defaults to {‘efSearch’: 100}.
-
classmethod
load(fname)¶ Load a NmslibIndexer instance from a file.
- Parameters
fname (str) – Path previously used in save().
-
most_similar(vector, num_neighbors)¶ Find the approximate num_neighbors most similar items.
- Parameters
vector (numpy.array) – Vector for a word or document.
num_neighbors (int) – How many most similar items to look for?
- Returns
List of most similar items in the format [(item, cosine_similarity), … ].
- Return type
list of (str, float)
-
save(fname, protocol=2)¶ Save this NmslibIndexer instance to a file.
- Parameters
fname (str) – Path to the output file, will produce 2 files: fname - parameters and fname.d -
NmslibIndex.protocol (int, optional) – Protocol for pickle.
Notes
This method saves only the index (the model isn’t preserved).
