corpora.ucicorpus
– Corpus in UCI format¶
Corpus in UCI format.
-
class
gensim.corpora.ucicorpus.
UciCorpus
(fname, fname_vocab=None)¶ Bases:
gensim.corpora.ucicorpus.UciReader
,gensim.corpora.indexedcorpus.IndexedCorpus
Corpus in the UCI bag-of-words format.
- Parameters
fname (str) – Path to corpus in UCI format.
fname_vocab (bool, optional) – Path to vocab.
Examples
>>> from gensim.corpora import UciCorpus >>> from gensim.test.utils import datapath >>> >>> corpus = UciCorpus(datapath('testcorpus.uci')) >>> for document in corpus: ... pass
-
create_dictionary
()¶ Generate
gensim.corpora.dictionary.Dictionary
directly from the corpus and vocabulary data.- Returns
Dictionary, based on corpus.
- Return type
Examples
>>> from gensim.corpora.ucicorpus import UciCorpus >>> from gensim.test.utils import datapath >>> ucc = UciCorpus(datapath('testcorpus.uci')) >>> dictionary = ucc.create_dictionary()
-
docbyoffset
(self, offset)¶ Get the document at file offset offset (in bytes).
- Parameters
offset (int) – File offset, in bytes, of the desired document.
- Returns
Document in sparse bag-of-words format.
- Return type
list of (int, str)
-
input
¶ object
- Type
input
-
classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
-
num_docs
¶ ‘long long’
- Type
num_docs
-
num_nnz
¶ ‘long long’
- Type
num_nnz
-
num_terms
¶ ‘long long’
- Type
num_terms
-
save
(*args, **kwargs)¶ Saves corpus in-memory state.
Warning
This save only the “state” of a corpus class, not the corpus data!
For saving data use the serialize method of the output format you’d like to use (e.g.
gensim.corpora.mmcorpus.MmCorpus.serialize()
).
-
static
save_corpus
(fname, corpus, id2word=None, progress_cnt=10000, metadata=False)¶ Save a corpus in the UCI Bag-of-Words format.
Warning
This function is automatically called by :meth`gensim.corpora.ucicorpus.UciCorpus.serialize`, don’t call it directly, call :meth`gensim.corpora.ucicorpus.UciCorpus.serialize` instead.
- Parameters
fname (str) – Path to output file.
corpus (iterable of iterable of (int, int)) – Corpus in BoW format.
id2word ({dict of (int, str),
gensim.corpora.dictionary.Dictionary
}, optional) – Mapping between words and their ids. If None - will be inferred from corpus.progress_cnt (int, optional) – Progress counter, write log message each progress_cnt documents.
metadata (bool, optional) – THIS PARAMETER WILL BE IGNORED.
Notes
There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.
-
classmethod
serialize
(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶ Serialize corpus with offset metadata, allows to use direct indexes after loading.
- Parameters
fname (str) – Path to output file.
corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
id2word (dict of (str, str), optional) – Mapping id -> word.
index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.
progress_cnt (int, optional) – Number of documents after which progress info is printed.
labels (bool, optional) – If True - ignore first column (class labels).
metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.
Examples
>>> from gensim.corpora import MmCorpus >>> from gensim.test.utils import get_tmpfile >>> >>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]] >>> output_fname = get_tmpfile("test.mm") >>> >>> MmCorpus.serialize(output_fname, corpus) >>> mm = MmCorpus(output_fname) # `mm` document stream now has random access >>> print(mm[1]) # retrieve document no. 42, etc. [(1, 0.1)]
-
skip_headers
(input_file)¶ Skip headers in input_file.
- Parameters
input_file (file) – File object.
-
transposed
¶ ‘bool’
- Type
transposed
-
class
gensim.corpora.ucicorpus.
UciReader
(input)¶ Bases:
gensim.corpora._mmreader.MmReader
Reader of UCI format for
gensim.corpora.ucicorpus.UciCorpus
.- Parameters
input (str) – Path to file in UCI format.
-
docbyoffset
(self, offset)¶ Get the document at file offset offset (in bytes).
- Parameters
offset (int) – File offset, in bytes, of the desired document.
- Returns
Document in sparse bag-of-words format.
- Return type
list of (int, str)
-
input
¶ object
- Type
input
-
num_docs
¶ ‘long long’
- Type
num_docs
-
num_nnz
¶ ‘long long’
- Type
num_nnz
-
num_terms
¶ ‘long long’
- Type
num_terms
-
skip_headers
(input_file)¶ Skip headers in input_file.
- Parameters
input_file (file) – File object.
-
transposed
¶ ‘bool’
- Type
transposed
-
class
gensim.corpora.ucicorpus.
UciWriter
(fname)¶ Bases:
gensim.matutils.MmWriter
Writer of UCI format for
gensim.corpora.ucicorpus.UciCorpus
.Notes
This corpus format is identical to Matrix Market format<http://math.nist.gov/MatrixMarket/formats.html>, except for different file headers. There is no format line, and the first three lines of the file contain `number_docs, num_terms, and num_nnz, one value per line.
- Parameters
fname (str) – Path to output file.
-
FAKE_HEADER
= b' \n'¶
-
HEADER_LINE
= b'%%MatrixMarket matrix coordinate real general\n'¶
-
MAX_HEADER_LENGTH
= 20¶
-
close
()¶ Close self.fout file.
-
fake_headers
(num_docs, num_terms, num_nnz)¶ Write “fake” headers to file, to be rewritten once we’ve scanned the entire corpus.
- Parameters
num_docs (int) – Number of documents in corpus.
num_terms (int) – Number of term in corpus.
num_nnz (int) – Number of non-zero elements in corpus.
-
update_headers
(num_docs, num_terms, num_nnz)¶ Update headers with actual values.
-
static
write_corpus
(fname, corpus, progress_cnt=1000, index=False)¶ Write corpus in file.
- Parameters
fname (str) – Path to output file.
corpus (iterable of list of (int, int)) – Corpus in BoW format.
progress_cnt (int, optional) – Progress counter, write log message each progress_cnt documents.
index (bool, optional) – If True - return offsets, otherwise - nothing.
- Returns
Sequence of offsets to documents (in bytes), only if index=True.
- Return type
list of int
-
write_headers
()¶ Write blank header lines. Will be updated later, once corpus stats are known.
-
write_vector
(docno, vector)¶ Write a single sparse vector to the file.
- Parameters
docno (int) – Number of document.
vector (list of (int, number)) – Document in BoW format.
- Returns
Max word index in vector and len of vector. If vector is empty, return (-1, 0).
- Return type
(int, int)