corpora.bleicorpus
– Corpus in Blei’s LDA-C format¶
Сorpus in Blei’s LDA-C format.
-
class
gensim.corpora.bleicorpus.
BleiCorpus
(fname, fname_vocab=None)¶ Bases:
gensim.corpora.indexedcorpus.IndexedCorpus
Corpus in Blei’s LDA-C format.
The corpus is represented as two files: one describing the documents, and another describing the mapping between words and their ids.
Each document is one line:
N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN
The vocabulary is a file with words, one word per line; word at line K has an implicit id=K.
- Parameters
fname (str) – Path to corpus.
fname_vocab (str, optional) –
Vocabulary file. If fname_vocab is None, searching one of variants:
fname.vocab
fname/vocab.txt
fname_without_ext.vocab
fname_folder/vocab.txt
- Raises
IOError – If vocabulary file doesn’t exist.
-
docbyoffset
(offset)¶ Get document corresponding to offset. Offset can be given from
save_corpus()
.- Parameters
offset (int) – Position of the document in the file (in bytes).
- Returns
Document in BoW format.
- Return type
list of (int, float)
-
line2doc
(line)¶ Convert line in Blei LDA-C format to document (BoW representation).
- Parameters
line (str) – Line in Blei’s LDA-C format.
- Returns
Document’s BoW representation.
- Return type
list of (int, float)
-
classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
-
save
(*args, **kwargs)¶ Saves corpus in-memory state.
Warning
This save only the “state” of a corpus class, not the corpus data!
For saving data use the serialize method of the output format you’d like to use (e.g.
gensim.corpora.mmcorpus.MmCorpus.serialize()
).
-
static
save_corpus
(fname, corpus, id2word=None, metadata=False)¶ Save a corpus in the LDA-C format.
Notes
There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.
- Parameters
fname (str) – Path to output file.
corpus (iterable of iterable of (int, float)) – Input corpus in BoW format.
id2word (dict of (str, str), optional) – Mapping id -> word for corpus.
metadata (bool, optional) – THIS PARAMETER WILL BE IGNORED.
- Returns
Offsets for each line in file (in bytes).
- Return type
list of int
-
classmethod
serialize
(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶ Serialize corpus with offset metadata, allows to use direct indexes after loading.
- Parameters
fname (str) – Path to output file.
corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
id2word (dict of (str, str), optional) – Mapping id -> word.
index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.
progress_cnt (int, optional) – Number of documents after which progress info is printed.
labels (bool, optional) – If True - ignore first column (class labels).
metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.
Examples
>>> from gensim.corpora import MmCorpus >>> from gensim.test.utils import get_tmpfile >>> >>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]] >>> output_fname = get_tmpfile("test.mm") >>> >>> MmCorpus.serialize(output_fname, corpus) >>> mm = MmCorpus(output_fname) # `mm` document stream now has random access >>> print(mm[1]) # retrieve document no. 42, etc. [(1, 0.1)]