corpora.lowcorpus – Corpus in GibbsLda++ format

Corpus in GibbsLda++ format.

class gensim.corpora.lowcorpus.LowCorpus(fname, id2word=None, line2words=<function split_on_space>)

Bases: gensim.corpora.indexedcorpus.IndexedCorpus

Corpus handles input in GibbsLda++ format.

Format description

Both data for training/estimating the model and new data (i.e., previously unseen data) have the same format as follows

[M]
[document1]
[document2]
...
[documentM]

in which the first line is the total number for documents [M]. Each line after that is one document. [documenti] is the ith document of the dataset that consists of a list of Ni words/terms

[documenti] = [wordi1] [wordi2] ... [wordiNi]

in which all [wordij] (i=1..M, j=1..Ni) are text strings and they are separated by the blank character.

Examples

>>> from gensim.test.utils import get_tmpfile, common_texts
>>> from gensim.corpora import LowCorpus
>>> from gensim.corpora import Dictionary
>>>
>>> # Prepare needed data
>>> dictionary = Dictionary(common_texts)
>>> corpus = [dictionary.doc2bow(doc) for doc in common_texts]
>>>
>>> # Write corpus in GibbsLda++ format to disk
>>> output_fname = get_tmpfile("corpus.low")
>>> LowCorpus.serialize(output_fname, corpus, dictionary)
>>>
>>> # Read corpus
>>> loaded_corpus = LowCorpus(output_fname)
Parameters
  • fname (str) – Path to file in GibbsLda++ format.

  • id2word ({dict of (int, str), Dictionary}, optional) – Mapping between word_ids (integers) and words (strings). If not provided, the mapping is constructed directly from fname.

  • line2words (callable, optional) – Function which converts lines(str) into tokens(list of str), using split_on_space() as default.

docbyoffset(offset)

Get the document stored in file by offset position.

Parameters

offset (int) – Offset (in bytes) to begin of document.

Returns

Document in BoW format.

Return type

list of (int, int)

Examples

>>> from gensim.test.utils import datapath
>>> from gensim.corpora import LowCorpus
>>>
>>> data = LowCorpus(datapath("testcorpus.low"))
>>> data.docbyoffset(1)  # end of first line
[]
>>> data.docbyoffset(2)  # start of second line
[(0, 1), (3, 1), (4, 1)]
property id2word

Get mapping between words and their ids.

line2doc(line)

Covert line into document in BoW format.

Parameters

line (str) – Line from input file.

Returns

Document in BoW format

Return type

list of (int, int)

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

static save_corpus(fname, corpus, id2word=None, metadata=False)

Save a corpus in the GibbsLda++ format.

Warning

This function is automatically called by gensim.corpora.lowcorpus.LowCorpus.serialize(), don’t call it directly, call gensim.corpora.lowcorpus.LowCorpus.serialize() instead.

Parameters
  • fname (str) – Path to output file.

  • corpus (iterable of iterable of (int, int)) – Corpus in BoW format.

  • id2word ({dict of (int, str), Dictionary}, optional) – Mapping between word_ids (integers) and words (strings). If not provided, the mapping is constructed directly from corpus.

  • metadata (bool, optional) – THIS PARAMETER WILL BE IGNORED.

Returns

List of offsets in resulting file for each document (in bytes), can be used for docbyoffset()

Return type

list of int

classmethod serialize(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)

Serialize corpus with offset metadata, allows to use direct indexes after loading.

Parameters
  • fname (str) – Path to output file.

  • corpus (iterable of iterable of (int, float)) – Corpus in BoW format.

  • id2word (dict of (str, str), optional) – Mapping id -> word.

  • index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.

  • progress_cnt (int, optional) – Number of documents after which progress info is printed.

  • labels (bool, optional) – If True - ignore first column (class labels).

  • metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.

Examples

>>> from gensim.corpora import MmCorpus
>>> from gensim.test.utils import get_tmpfile
>>>
>>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]]
>>> output_fname = get_tmpfile("test.mm")
>>>
>>> MmCorpus.serialize(output_fname, corpus)
>>> mm = MmCorpus(output_fname)  # `mm` document stream now has random access
>>> print(mm[1])  # retrieve document no. 42, etc.
[(1, 0.1)]
gensim.corpora.lowcorpus.split_on_space(s)

Split line by spaces, used in gensim.corpora.lowcorpus.LowCorpus.

Parameters

s (str) – Some line.

Returns

List of tokens from s.

Return type

list of str