models.translation_matrix – Translation Matrix model

Produce translation matrix to translate the word from one language to another language, using either standard nearest neighbour method or globally corrected neighbour retrieval method 1.

This method can be used to augment the existing phrase tables with more candidate translations, or filter out errors from the translation tables and known dictionaries 2. What’s more, It also work for any two sets of named-vectors where there are some paired-guideposts to learn the transformation.

Examples

How to make translation between two set of word-vectors

Initialize a word-vector models

>>> from gensim.models import KeyedVectors
>>> from gensim.test.utils import datapath
>>>
>>> model_en = KeyedVectors.load_word2vec_format(datapath("EN.1-10.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt"))
>>> model_it = KeyedVectors.load_word2vec_format(datapath("IT.1-10.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt"))

Define word pairs (that will be used for construction of translation matrix

>>> word_pairs = [
...     ("one", "uno"), ("two", "due"), ("three", "tre"), ("four", "quattro"), ("five", "cinque"),
...     ("seven", "sette"), ("eight", "otto"),
...     ("dog", "cane"), ("pig", "maiale"), ("fish", "cavallo"), ("birds", "uccelli"),
...     ("apple", "mela"), ("orange", "arancione"), ("grape", "acino"), ("banana", "banana")
... ]

Fit TranslationMatrix

>>> trans_model = TranslationMatrix(model_en, model_it, word_pairs=word_pairs)

Apply model (translate words “dog” and “one”)

>>> trans_model.translate(["dog", "one"], topn=3)
OrderedDict([('dog', [u'cane', u'gatto', u'cavallo']), ('one', [u'uno', u'due', u'tre'])])

Save / load model

>>> with temporary_file("model_file") as fname:
...     trans_model.save(fname)  # save model to file
...     loaded_trans_model = TranslationMatrix.load(fname)  # load model

How to make translation between two Doc2Vec models

Prepare data and models

>>> from gensim.test.utils import datapath
>>> from gensim.test.test_translation_matrix import read_sentiment_docs
>>> from gensim.models import Doc2Vec
>>>
>>> data = read_sentiment_docs(datapath("alldata-id-10.txt"))[:5]
>>> src_model = Doc2Vec.load(datapath("small_tag_doc_5_iter50"))
>>> dst_model = Doc2Vec.load(datapath("large_tag_doc_10_iter50"))

Train backward translation

>>> model_trans = BackMappingTranslationMatrix(data, src_model, dst_model)
>>> trans_matrix = model_trans.train(data)

Apply model

>>> result = model_trans.infer_vector(dst_model.dv[data[3].tags])

References

1(1,2)

Dinu, Georgiana, Angeliki Lazaridou, and Marco Baroni. “Improving zero-shot learning by mitigating the hubness problem”, https://arxiv.org/abs/1412.6568

2

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. “Distributed Representations of Words and Phrases and their Compositionality”, https://arxiv.org/abs/1310.4546

class gensim.models.translation_matrix.BackMappingTranslationMatrix(source_lang_vec, target_lang_vec, tagged_docs=None, random_state=None)

Bases: gensim.utils.SaveLoad

Realize the BackMapping translation matrix which map the source model’s document vector to the target model’s document vector(old model).

BackMapping translation matrix is used to learn a mapping for two document vector space which we specify as source document vector and target document vector. The target document vector are trained on superset corpus of source document vector, we can incrementally increase the vector in the old model through the BackMapping translation matrix.

the details use seen the notebook 3.

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.test.test_translation_matrix import read_sentiment_docs
>>> from gensim.models import Doc2Vec, BackMappingTranslationMatrix
>>>
>>> data = read_sentiment_docs(datapath("alldata-id-10.txt"))[:5]
>>> src_model = Doc2Vec.load(datapath("small_tag_doc_5_iter50"))
>>> dst_model = Doc2Vec.load(datapath("large_tag_doc_10_iter50"))
>>>
>>> model_trans = BackMappingTranslationMatrix(src_model, dst_model)
>>> trans_matrix = model_trans.train(data)
>>>
>>> result = model_trans.infer_vector(dst_model.dv[data[3].tags])
Parameters
  • source_lang_vec (Doc2Vec) – Source Doc2Vec model.

  • target_lang_vec (Doc2Vec) – Target Doc2Vec model.

  • tagged_docs (list of TaggedDocument, optional.) – Documents that will be used for training, both the source language document vector and target language document vector trained on those tagged documents.

  • random_state ({None, int, array_like}, optional) – Seed for random state.

infer_vector(target_doc_vec)

Translate the target model’s document vector to the source model’s document vector

Parameters

target_doc_vec (numpy.ndarray) – Document vector from the target document, whose document are not in the source model.

Returns

Vector target_doc_vec in the source model.

Return type

numpy.ndarray

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

save(fname_or_handle, separately=None, sep_limit=10485760, ignore=frozenset({}), pickle_protocol=2)

Save the object to a file.

Parameters
  • fname_or_handle (str or file-like) – Path to output file or already opened file-like object. If the object is a file handle, no special array handling will be performed, all attributes will be saved to the same file.

  • separately (list of str or None, optional) –

    If None, automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files. This prevent memory errors for large objects, and also allows memory-mapping the large arrays for efficient loading and sharing the large arrays in RAM between multiple processes.

    If list of str: store these attributes into separate files. The automated size check is not performed in this case.

  • sep_limit (int, optional) – Don’t store arrays smaller than this separately. In bytes.

  • ignore (frozenset of str, optional) – Attributes that shouldn’t be stored at all.

  • pickle_protocol (int, optional) – Protocol number for pickle.

See also

load()

Load object from file.

train(tagged_docs)

Build the translation matrix that mapping from the source model’s vector to target model’s vector

Parameters

tagged_docs (list of TaggedDocument, Documents) – that will be used for training, both the source language document vector and target language document vector trained on those tagged documents.

Returns

Translation matrix that mapping from the source model’s vector to target model’s vector.

Return type

numpy.ndarray

class gensim.models.translation_matrix.Space(matrix, index2word)

Bases: object

An auxiliary class for storing the the words space.

Parameters
  • matrix (iterable of numpy.ndarray) – Matrix that contains word-vectors.

  • index2word (list of str) – Words which correspond to the matrix.

classmethod build(lang_vec, lexicon=None)

Construct a space class for the lexicon, if it’s provided.

Parameters
  • lang_vec (KeyedVectors) – Model from which the vectors will be extracted.

  • lexicon (list of str, optional) – Words which contains in the lang_vec, if lexicon = None, the lexicon is all the lang_vec’s word.

Returns

Object that stored word-vectors

Return type

Space

normalize()

Normalize the word vector’s matrix.

class gensim.models.translation_matrix.TranslationMatrix(source_lang_vec, target_lang_vec, word_pairs=None, random_state=None)

Bases: gensim.utils.SaveLoad

Objects of this class realize the translation matrix which map the source language to the target language. The main methods are:

We map it to the other language space by computing z = Wx, then return the word whose representation is close to z.

The details use seen the notebook 3

Examples

>>> from gensim.models import KeyedVectors
>>> from gensim.test.utils import datapath
>>> en = datapath("EN.1-10.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt")
>>> it = datapath("IT.1-10.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt")
>>> model_en = KeyedVectors.load_word2vec_format(en)
>>> model_it = KeyedVectors.load_word2vec_format(it)
>>>
>>> word_pairs = [
...     ("one", "uno"), ("two", "due"), ("three", "tre"), ("four", "quattro"), ("five", "cinque"),
...     ("seven", "sette"), ("eight", "otto"),
...     ("dog", "cane"), ("pig", "maiale"), ("fish", "cavallo"), ("birds", "uccelli"),
...     ("apple", "mela"), ("orange", "arancione"), ("grape", "acino"), ("banana", "banana")
... ]
>>>
>>> trans_model = TranslationMatrix(model_en, model_it)
>>> trans_model.train(word_pairs)
>>> trans_model.translate(["dog", "one"], topn=3)
OrderedDict([('dog', [u'cane', u'gatto', u'cavallo']), ('one', [u'uno', u'due', u'tre'])])

References

3(1,2)

https://github.com/RaRe-Technologies/gensim/blob/3.2.0/docs/notebooks/translation_matrix.ipynb

Parameters
  • source_lang_vec (KeyedVectors) – Word vectors for source language.

  • target_lang_vec (KeyedVectors) – Word vectors for target language.

  • word_pairs (list of (str, str), optional) – Pairs of words that will be used for training.

  • random_state ({None, int, array_like}, optional) – Seed for random state.

apply_transmat(words_space)

Map the source word vector to the target word vector using translation matrix.

Parameters

words_space (Space) – Object that constructed for those words to be translate.

Returns

Object that constructed for those mapped words.

Return type

Space

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

save(*args, **kwargs)

Save the model to file but ignoring the source_space and target_space

train(word_pairs)

Build the translation matrix that mapping from source space to target space.

Parameters

word_pairs (list of (str, str), optional) – Pairs of words that will be used for training.

translate(source_words, topn=5, gc=0, sample_num=None, source_lang_vec=None, target_lang_vec=None)

Translate the word from the source language to the target language.

Parameters
  • source_words ({str, list of str}) – Single word or a list of words to be translated

  • topn (int, optional) – Number of words that will be returned as translation for each source_words

  • gc (int, optional) – Define translation algorithm, if gc == 0 - use standard NN retrieval, otherwise, use globally corrected neighbour retrieval method (as described in 1).

  • sample_num (int, optional) – Number of word to sample from the source lexicon, if gc == 1, then sample_num must be provided.

  • source_lang_vec (KeyedVectors, optional) – New source language vectors for translation, by default, used the model’s source language vector.

  • target_lang_vec (KeyedVectors, optional) – New target language vectors for translation, by default, used the model’s target language vector.

Returns

Ordered dict where each item is word: [translated_word_1, translated_word_2, …]

Return type

collections.OrderedDict