corpora._mmreader
– Read corpus in the Matrix Market format¶
Reader for corpus in the Matrix Market format.
-
class
gensim.corpora._mmreader.
MmReader
(input, transposed=True)¶ Bases:
object
Matrix market file reader (fast Cython version), used internally in
MmCorpus
.Wrap a term-document matrix on disk (in matrix-market format), and present it as an object which supports iteration over the rows (~documents).
-
num_docs
¶ Number of documents in the market matrix file.
- Type
int
-
num_terms
¶ Number of terms.
- Type
int
-
num_nnz
¶ Number of non-zero terms.
- Type
int
Notes
Note that the file is read into memory one document at a time, not the whole matrix at once (unlike e.g. scipy.io.mmread and other implementations). This allows us to process corpora which are larger than the available RAM.
- Parameters
input ({str, file-like object}) – Path to the input file in MM format or a file-like object that supports seek() (e.g. smart_open objects).
transposed (bool, optional) – Do lines represent doc_id, term_id, value, instead of term_id, doc_id, value?
-
docbyoffset
(self, offset)¶ Get the document at file offset offset (in bytes).
- Parameters
offset (int) – File offset, in bytes, of the desired document.
- Returns
Document in sparse bag-of-words format.
- Return type
list of (int, str)
-
input
¶ object
- Type
input
-
num_docs
‘long long’
- Type
num_docs
-
num_nnz
‘long long’
- Type
num_nnz
-
num_terms
‘long long’
- Type
num_terms
-
skip_headers
(self, input_file)¶ Skip file headers that appear before the first document.
- Parameters
input_file (iterable of str) – Iterable taken from file in MM format.
-
transposed
¶ ‘bool’
- Type
transposed
-