`corpora._mmreader` – Read corpus in the Matrix Market format¶

Reader for corpus in the Matrix Market format.

class gensim.corpora._mmreader.MmReader(input, transposed=True)¶

Bases: object

Matrix market file reader (fast Cython version), used internally in MmCorpus.

Wrap a term-document matrix on disk (in matrix-market format), and present it as an object which supports iteration over the rows (~documents).

num_docs¶

Number of documents in the market matrix file.

Type: int

num_terms¶

Number of terms.

Type: int

num_nnz¶

Number of non-zero terms.

Type: int

Notes

Note that the file is read into memory one document at a time, not the whole matrix at once (unlike e.g. scipy.io.mmread and other implementations). This allows us to process corpora which are larger than the available RAM.

Parameters

input ({str, file-like object}) – Path to the input file in MM format or a file-like object that supports seek() (e.g. smart_open objects).
transposed (bool, optional) – Do lines represent doc_id, term_id, value, instead of term_id, doc_id, value?

docbyoffset(self, offset)¶

Get the document at file offset offset (in bytes).

Parameters: offset (int) – File offset, in bytes, of the desired document.
Returns: Document in sparse bag-of-words format.
Return type: list of (int, str)

input¶

object

Type: input

num_docs

‘long long’

Type: num_docs

num_nnz

‘long long’

Type: num_nnz

num_terms

‘long long’

Type: num_terms

skip_headers(self, input_file)¶

Skip file headers that appear before the first document.

Parameters: input_file (iterable of str) – Iterable taken from file in MM format.

transposed¶

‘bool’

Type: transposed

Get Expert Help From The Gensim Authors

corpora._mmreader – Read corpus in the Matrix Market format¶

`corpora._mmreader` – Read corpus in the Matrix Market format¶