corpora.bleicorpus – Corpus in Blei’s LDA-C format

Сorpus in Blei’s LDA-C format.

class gensim.corpora.bleicorpus.BleiCorpus(fname, fname_vocab=None)

Bases: gensim.corpora.indexedcorpus.IndexedCorpus

Corpus in Blei’s LDA-C format.

The corpus is represented as two files: one describing the documents, and another describing the mapping between words and their ids.

Each document is one line:

N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN

The vocabulary is a file with words, one word per line; word at line K has an implicit id=K.

Parameters
  • fname (str) – Path to corpus.

  • fname_vocab (str, optional) –

    Vocabulary file. If fname_vocab is None, searching one of variants:

    • fname.vocab

    • fname/vocab.txt

    • fname_without_ext.vocab

    • fname_folder/vocab.txt

Raises

IOError – If vocabulary file doesn’t exist.

docbyoffset(offset)

Get document corresponding to offset. Offset can be given from save_corpus().

Parameters

offset (int) – Position of the document in the file (in bytes).

Returns

Document in BoW format.

Return type

list of (int, float)

line2doc(line)

Convert line in Blei LDA-C format to document (BoW representation).

Parameters

line (str) – Line in Blei’s LDA-C format.

Returns

Document’s BoW representation.

Return type

list of (int, float)

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

static save_corpus(fname, corpus, id2word=None, metadata=False)

Save a corpus in the LDA-C format.

Notes

There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.

Parameters
  • fname (str) – Path to output file.

  • corpus (iterable of iterable of (int, float)) – Input corpus in BoW format.

  • id2word (dict of (str, str), optional) – Mapping id -> word for corpus.

  • metadata (bool, optional) – THIS PARAMETER WILL BE IGNORED.

Returns

Offsets for each line in file (in bytes).

Return type

list of int

classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Serialize corpus with offset metadata, allows to use direct indexes after loading.

Parameters
  • fname (str) – Path to output file.

  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.

  • id2word (dict of (str, str), optional) – Mapping id -> word.

  • index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.

  • progress_cnt (int, optional) – Number of documents after which progress info is printed.

  • labels (bool, optional) – If True - ignore first column (class labels).

  • metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.

Examples

>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("test.mm")
>>>
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname)  # `mm` document stream now has random access
>>> print(mm[1])  # retrieve document no. 42, etc.
[(1, 0.1)]