corpora.mmcorpus
– Corpus in Matrix Market format¶
Corpus in the Matrix Market format.
-
class
gensim.corpora.mmcorpus.
MmCorpus
(fname)¶ Bases:
gensim.corpora._mmreader.MmReader
,gensim.corpora.indexedcorpus.IndexedCorpus
Corpus serialized using the sparse coordinate Matrix Market format.
Wrap a term-document matrix on disk (in matrix-market format), and present it as an object which supports iteration over the matrix rows (~documents).
Notable instance attributes:
-
num_docs
¶ Number of documents in the market matrix file.
- Type
int
-
num_terms
¶ Number of features (terms, topics).
- Type
int
-
num_nnz
¶ Number of non-zero elements in the sparse MM matrix.
- Type
int
Notes
The file is read into memory one document at a time, not the whole matrix at once, unlike e.g. scipy.io.mmread and other implementations. This allows you to process corpora which are larger than the available RAM, in a streamed manner.
Example
>>> from gensim.corpora.mmcorpus import MmCorpus >>> from gensim.test.utils import datapath >>> >>> corpus = MmCorpus(datapath('test_mmcorpus_with_index.mm')) >>> for document in corpus: ... pass
- Parameters
fname ({str, file-like object}) – Path to file in MM format or a file-like object that supports seek() (e.g. a compressed file opened by smart_open).
-
docbyoffset
(self, offset)¶ Get the document at file offset offset (in bytes).
- Parameters
offset (int) – File offset, in bytes, of the desired document.
- Returns
Document in sparse bag-of-words format.
- Return type
list of (int, str)
-
input
¶ object
- Type
input
-
classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
-
num_docs
‘long long’
- Type
num_docs
-
num_nnz
‘long long’
- Type
num_nnz
-
num_terms
‘long long’
- Type
num_terms
-
save
(*args, **kwargs)¶ Saves corpus in-memory state.
Warning
This save only the “state” of a corpus class, not the corpus data!
For saving data use the serialize method of the output format you’d like to use (e.g.
gensim.corpora.mmcorpus.MmCorpus.serialize()
).
-
static
save_corpus
(fname, corpus, id2word=None, progress_cnt=1000, metadata=False)¶ Save a corpus to disk in the sparse coordinate Matrix Market format.
- Parameters
fname (str) – Path to file.
corpus (iterable of list of (int, number)) – Corpus in Bow format.
id2word (dict of (int, str), optional) – Mapping between word_id -> word. Used to retrieve the total vocabulary size if provided. Otherwise, the total vocabulary size is estimated based on the highest feature id encountered in corpus.
progress_cnt (int, optional) – How often to report (log) progress.
metadata (bool, optional) – Writes out additional metadata?
Warning
This function is automatically called by
serialize
, don’t call it directly, callserialize
instead.Example
>>> from gensim.corpora.mmcorpus import MmCorpus >>> from gensim.test.utils import datapath >>> >>> corpus = MmCorpus(datapath('test_mmcorpus_with_index.mm')) >>> >>> MmCorpus.save_corpus("random", corpus) # Do not do it, use `serialize` instead. [97, 121, 169, 201, 225, 249, 258, 276, 303]
-
classmethod
serialize
(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶ Serialize corpus with offset metadata, allows to use direct indexes after loading.
- Parameters
fname (str) – Path to output file.
corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
id2word (dict of (str, str), optional) – Mapping id -> word.
index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.
progress_cnt (int, optional) – Number of documents after which progress info is printed.
labels (bool, optional) – If True - ignore first column (class labels).
metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.
Examples
>>> from gensim.corpora import MmCorpus >>> from gensim.test.utils import get_tmpfile >>> >>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]] >>> output_fname = get_tmpfile("test.mm") >>> >>> MmCorpus.serialize(output_fname, corpus) >>> mm = MmCorpus(output_fname) # `mm` document stream now has random access >>> print(mm[1]) # retrieve document no. 42, etc. [(1, 0.1)]
-
skip_headers
(self, input_file)¶ Skip file headers that appear before the first document.
- Parameters
input_file (iterable of str) – Iterable taken from file in MM format.
-
transposed
¶ ‘bool’
- Type
transposed
-