corpora.malletcorpus
– Corpus in Mallet format¶
Corpus in Mallet format.
-
class
gensim.corpora.malletcorpus.
MalletCorpus
(fname, id2word=None, metadata=False)¶ Bases:
gensim.corpora.lowcorpus.LowCorpus
Corpus handles input in Mallet format.
Format description
One file, one instance per line, assume the data is in the following format
[URL] [language] [text of the page...]
Or, more generally,
[document #1 id] [label] [text of the document...] [document #2 id] [label] [text of the document...] ... [document #N id] [label] [text of the document...]
Note that language/label is not considered in Gensim, used __unknown__ as default value.
Examples
>>> from gensim.test.utils import get_tmpfile, common_texts >>> from gensim.corpora import MalletCorpus >>> from gensim.corpora import Dictionary >>> >>> # Prepare needed data >>> dictionary = Dictionary(common_texts) >>> corpus = [dictionary.doc2bow(doc) for doc in common_texts] >>> >>> # Write corpus in Mallet format to disk >>> output_fname = get_tmpfile("corpus.mallet") >>> MalletCorpus.serialize(output_fname, corpus, dictionary) >>> >>> # Read corpus >>> loaded_corpus = MalletCorpus(output_fname)
- Parameters
fname (str) – Path to file in Mallet format.
id2word ({dict of (int, str),
Dictionary
}, optional) – Mapping between word_ids (integers) and words (strings). If not provided, the mapping is constructed directly from fname.metadata (bool, optional) – If True, return additional information (“document id” and “lang” when you call
line2doc()
,__iter__()
ordocbyoffset()
-
docbyoffset
(offset)¶ Get the document stored in file by offset position.
- Parameters
offset (int) – Offset (in bytes) to begin of document.
- Returns
Document in BoW format (+”document_id” and “lang” if metadata=True).
- Return type
list of (int, int)
Examples
>>> from gensim.test.utils import datapath >>> from gensim.corpora import MalletCorpus >>> >>> data = MalletCorpus(datapath("testcorpus.mallet")) >>> data.docbyoffset(1) # end of first line [(3, 1), (4, 1)] >>> data.docbyoffset(4) # start of second line [(4, 1)]
-
property
id2word
¶ Get mapping between words and their ids.
-
line2doc
(line)¶ Covert line into document in BoW format.
- Parameters
line (str) – Line from input file.
- Returns
Document in BoW format (+”document_id” and “lang” if metadata=True).
- Return type
list of (int, int)
Examples
>>> from gensim.test.utils import datapath >>> from gensim.corpora import MalletCorpus >>> >>> corpus = MalletCorpus(datapath("testcorpus.mallet")) >>> corpus.line2doc("en computer human interface") [(3, 1), (4, 1)]
-
classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
-
save
(*args, **kwargs)¶ Saves corpus in-memory state.
Warning
This save only the “state” of a corpus class, not the corpus data!
For saving data use the serialize method of the output format you’d like to use (e.g.
gensim.corpora.mmcorpus.MmCorpus.serialize()
).
-
static
save_corpus
(fname, corpus, id2word=None, metadata=False)¶ Save a corpus in the Mallet format.
Warning
This function is automatically called by
gensim.corpora.malletcorpus.MalletCorpus.serialize()
, don’t call it directly, callgensim.corpora.lowcorpus.malletcorpus.MalletCorpus.serialize()
instead.- Parameters
fname (str) – Path to output file.
corpus (iterable of iterable of (int, int)) – Corpus in BoW format.
id2word ({dict of (int, str),
Dictionary
}, optional) – Mapping between word_ids (integers) and words (strings). If not provided, the mapping is constructed directly from corpus.metadata (bool, optional) – If True - ????
- Returns
List of offsets in resulting file for each document (in bytes), can be used for
docbyoffset()
.- Return type
list of int
Notes
The document id will be generated by enumerating the corpus. That is, it will range between 0 and number of documents in the corpus.
Since Mallet has a language field in the format, this defaults to the string ‘__unknown__’. If the language needs to be saved, post-processing will be required.
-
classmethod
serialize
(fname, corpus, id2word=None, index_fname=None, progress_cnt=None, labels=None, metadata=False)¶ Serialize corpus with offset metadata, allows to use direct indexes after loading.
- Parameters
fname (str) – Path to output file.
corpus (iterable of iterable of (int, float)) – Corpus in BoW format.
id2word (dict of (str, str), optional) – Mapping id -> word.
index_fname (str, optional) – Where to save resulting index, if None - store index to fname.index.
progress_cnt (int, optional) – Number of documents after which progress info is printed.
labels (bool, optional) – If True - ignore first column (class labels).
metadata (bool, optional) – If True - ensure that serialize will write out article titles to a pickle file.
Examples
>>> from gensim.corpora import MmCorpus >>> from gensim.test.utils import get_tmpfile >>> >>> corpus = [[(1, 0.3), (2, 0.1)], [(1, 0.1)], [(2, 0.3)]] >>> output_fname = get_tmpfile("test.mm") >>> >>> MmCorpus.serialize(output_fname, corpus) >>> mm = MmCorpus(output_fname) # `mm` document stream now has random access >>> print(mm[1]) # retrieve document no. 42, etc. [(1, 0.1)]