corpora.textcorpus – Tools for building corpora with dictionaries

Module provides some code scaffolding to simplify use of built dictionary for constructing BoW vectors.

Notes

Text corpora usually reside on disk, as text files in one format or another In a common scenario, we need to build a dictionary (a word->integer id mapping), which is then used to construct sparse bag-of-word vectors (= iterable of (word_id, word_weight)).

This module provides some code scaffolding to simplify this pipeline. For example, given a corpus where each document is a separate line in file on disk, you would override the gensim.corpora.textcorpus.TextCorpus.get_texts() to read one line=document at a time, process it (lowercase, tokenize, whatever) and yield it as a sequence of words.

Overriding gensim.corpora.textcorpus.TextCorpus.get_texts() is enough, you can then initialize the corpus with e.g. MyTextCorpus(“mycorpus.txt.bz2”) and it will behave correctly like a corpus of sparse vectors. The __iter__() method is automatically set up, and dictionary is automatically populated with all word->id mappings.

The resulting object can be used as input to some of gensim models (TfidfModel, LsiModel, LdaModel, …), serialized with any format (Matrix Market, SvmLight, Blei’s LDA-C format, etc).

See also

gensim.test.test_miislita.CorpusMiislita

Good simple example.

class gensim.corpora.textcorpus.TextCorpus(input=None, dictionary=None, metadata=False, character_filters=None, tokenizer=None, token_filters=None)

Bases: gensim.interfaces.CorpusABC

Helper class to simplify the pipeline of getting BoW vectors from plain text.

Notes

This is an abstract base class: override the get_texts() and __len__() methods to match your particular input.

Given a filename (or a file-like object) in constructor, the corpus object will be automatically initialized with a dictionary in self.dictionary and will support the __iter__() corpus method. You have a few different ways of utilizing this class via subclassing or by construction with different preprocessing arguments.

The __iter__() method converts the lists of tokens produced by get_texts() to BoW format using gensim.corpora.dictionary.Dictionary.doc2bow().

get_texts() does the following:

  1. Calls getstream() to get a generator over the texts. It yields each document in turn from the underlying text file or files.

  2. For each document from the stream, calls preprocess_text() to produce a list of tokens. If metadata=True, it yields a 2-tuple with the document number as the second element.

Preprocessing consists of 0+ character_filters, a tokenizer, and 0+ token_filters.

The preprocessing consists of calling each filter in character_filters with the document text. Unicode is not guaranteed, and if desired, the first filter should convert to unicode. The output of each character filter should be another string. The output from the final filter is fed to the tokenizer, which should split the string into a list of tokens (strings). Afterwards, the list of tokens is fed through each filter in token_filters. The final output returned from preprocess_text() is the output from the final token filter.

So to use this class, you can either pass in different preprocessing functions using the character_filters, tokenizer, and token_filters arguments, or you can subclass it.

If subclassing: override getstream() to take text from different input sources in different formats. Override preprocess_text() if you must provide different initial preprocessing, then call the preprocess_text() method to apply the normal preprocessing. You can also override get_texts() in order to tag the documents (token lists) with different metadata.

The default preprocessing consists of:

  1. lower_to_unicode() - lowercase and convert to unicode (assumes utf8 encoding)

  2. deaccent()- deaccent (asciifolding)

  3. strip_multiple_whitespaces() - collapse multiple whitespaces into a single one

  4. simple_tokenize() - tokenize by splitting on whitespace

  5. remove_short() - remove words less than 3 characters long

  6. remove_stopwords() - remove stopwords

Parameters
  • input (str, optional) – Path to top-level directory (file) to traverse for corpus documents.

  • dictionary (Dictionary, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus. If input is None, the dictionary will remain uninitialized.

  • metadata (bool, optional) – If True - yield metadata with each document.

  • character_filters (iterable of callable, optional) – Each will be applied to the text of each document in order, and should return a single string with the modified text. For Python 2, the original text will not be unicode, so it may be useful to convert to unicode as the first character filter. If None - using lower_to_unicode(), deaccent() and strip_multiple_whitespaces().

  • tokenizer (callable, optional) – Tokenizer for document, if None - using simple_tokenize().

  • token_filters (iterable of callable, optional) – Each will be applied to the iterable of tokens in order, and should return another iterable of tokens. These filters can add, remove, or replace tokens, or do nothing at all. If None - using remove_short() and remove_stopwords().

Examples

>>> from gensim.corpora.textcorpus import TextCorpus
>>> from gensim.test.utils import datapath
>>> from gensim import utils
>>>
>>>
>>> class CorpusMiislita(TextCorpus):
...     stopwords = set('for a of the and to in on'.split())
...
...     def get_texts(self):
...         for doc in self.getstream():
...             yield [word for word in utils.to_unicode(doc).lower().split() if word not in self.stopwords]
...
...     def __len__(self):
...         self.length = sum(1 for _ in self.get_texts())
...         return self.length
>>>
>>>
>>> corpus = CorpusMiislita(datapath('head500.noblanks.cor.bz2'))
>>> len(corpus)
250
>>> document = next(iter(corpus.get_texts()))
get_texts()

Generate documents from corpus.

Yields

list of str – Document as sequence of tokens (+ lineno if self.metadata)

getstream()

Generate documents from the underlying plain text collection (of one or more files).

Yields

str – Document read from plain-text file.

Notes

After generator end - initialize self.length attribute.

init_dictionary(dictionary)

Initialize/update dictionary.

Parameters

dictionary (Dictionary, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus.

Notes

If self.input is None - make nothing.

classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

preprocess_text(text)

Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.

Parameters

text (str) – Document read from plain-text file.

Returns

List of tokens extracted from text.

Return type

list of str

sample_texts(n, seed=None, length=None)

Generate n random documents from the corpus without replacement.

Parameters
  • n (int) – Number of documents we want to sample.

  • seed (int, optional) – If specified, use it as a seed for local random generator.

  • length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.

Raises

ValueError – If n less than zero or greater than corpus size.

Notes

Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n / remaining. If we choose it, we just decrease the n and move to the next element.

Yields

list of str – Sampled document as sequence of tokens.

save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

static save_corpus(fname, corpus, id2word=None, metadata=False)

Save corpus to disk.

Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.

Notes

Some corpora also support random access via document indexing, so that the documents on disk can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus base class).

In this case, save_corpus() is automatically called internally by serialize(), which does save_corpus() plus saves the index at the same time.

Calling serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus().

Parameters
  • fname (str) – Path to output file.

  • corpus (iterable of list of (int, number)) – Corpus in BoW format.

  • id2word (Dictionary, optional) – Dictionary of corpus.

  • metadata (bool, optional) – Write additional metadata to a separate too?

step_through_preprocess(text)

Apply preprocessor one by one and generate result.

Warning

This is useful for debugging issues with the corpus preprocessing pipeline.

Parameters

text (str) – Document text read from plain-text file.

Yields

(callable, object) – Pre-processor, output from pre-processor (based on text)

class gensim.corpora.textcorpus.TextDirectoryCorpus(input, dictionary=None, metadata=False, min_depth=0, max_depth=None, pattern=None, exclude_pattern=None, lines_are_documents=False, **kwargs)

Bases: gensim.corpora.textcorpus.TextCorpus

Read documents recursively from a directory. Each file/line (depends on lines_are_documents) is interpreted as a plain text document.

Parameters
  • input (str) – Path to input file/folder.

  • dictionary (Dictionary, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus. If input is None, the dictionary will remain uninitialized.

  • metadata (bool, optional) – If True - yield metadata with each document.

  • min_depth (int, optional) – Minimum depth in directory tree at which to begin searching for files.

  • max_depth (int, optional) – Max depth in directory tree at which files will no longer be considered. If None - not limited.

  • pattern (str, optional) – Regex to use for file name inclusion, all those files not matching this pattern will be ignored.

  • exclude_pattern (str, optional) – Regex to use for file name exclusion, all files matching this pattern will be ignored.

  • lines_are_documents (bool, optional) – If True - each line is considered a document, otherwise - each file is one document.

  • kwargs (keyword arguments passed through to the TextCorpus constructor.) – See gemsim.corpora.textcorpus.TextCorpus.__init__() docstring for more details on these.

property exclude_pattern
get_texts()

Generate documents from corpus.

Yields

list of str – Document as sequence of tokens (+ lineno if self.metadata)

getstream()

Generate documents from the underlying plain text collection (of one or more files).

Yields

str – One document (if lines_are_documents - True), otherwise - each file is one document.

init_dictionary(dictionary)

Initialize/update dictionary.

Parameters

dictionary (Dictionary, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus.

Notes

If self.input is None - make nothing.

iter_filepaths()

Generate (lazily) paths to each file in the directory structure within the specified range of depths. If a filename pattern to match was given, further filter to only those filenames that match.

Yields

str – Path to file

property lines_are_documents
classmethod load(fname, mmap=None)

Load an object previously saved using save() from a file.

Parameters
  • fname (str) – Path to file that contains needed object.

  • mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.

See also

save()

Save object to file.

Returns

Object loaded from fname.

Return type

object

Raises

AttributeError – When called on an object instance instead of class (this is a class method).

property max_depth
property min_depth
property pattern
preprocess_text(text)

Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.

Parameters

text (str) – Document read from plain-text file.

Returns

List of tokens extracted from text.

Return type

list of str

sample_texts(n, seed=None, length=None)

Generate n random documents from the corpus without replacement.

Parameters
  • n (int) – Number of documents we want to sample.

  • seed (int, optional) – If specified, use it as a seed for local random generator.

  • length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.

Raises

ValueError – If n less than zero or greater than corpus size.

Notes

Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n / remaining. If we choose it, we just decrease the n and move to the next element.

Yields

list of str – Sampled document as sequence of tokens.

save(*args, **kwargs)

Saves corpus in-memory state.

Warning

This save only the “state” of a corpus class, not the corpus data!

For saving data use the serialize method of the output format you’d like to use (e.g. gensim.corpora.mmcorpus.MmCorpus.serialize()).

static save_corpus(fname, corpus, id2word=None, metadata=False)

Save corpus to disk.

Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.

Notes

Some corpora also support random access via document indexing, so that the documents on disk can be accessed in O(1) time (see the gensim.corpora.indexedcorpus.IndexedCorpus base class).

In this case, save_corpus() is automatically called internally by serialize(), which does save_corpus() plus saves the index at the same time.

Calling serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus().

Parameters
  • fname (str) – Path to output file.

  • corpus (iterable of list of (int, number)) – Corpus in BoW format.

  • id2word (Dictionary, optional) – Dictionary of corpus.

  • metadata (bool, optional) – Write additional metadata to a separate too?

step_through_preprocess(text)

Apply preprocessor one by one and generate result.

Warning

This is useful for debugging issues with the corpus preprocessing pipeline.

Parameters

text (str) – Document text read from plain-text file.

Yields

(callable, object) – Pre-processor, output from pre-processor (based on text)

gensim.corpora.textcorpus.lower_to_unicode(text, encoding='utf8', errors='strict')

Lowercase text and convert to unicode, using gensim.utils.any2unicode().

Parameters
  • text (str) – Input text.

  • encoding (str, optional) – Encoding that will be used for conversion.

  • errors (str, optional) – Error handling behaviour, used as parameter for unicode function (python2 only).

Returns

Unicode version of text.

Return type

str

See also

gensim.utils.any2unicode()

Convert any string to unicode-string.

gensim.corpora.textcorpus.remove_short(tokens, minsize=3)

Remove tokens shorter than minsize chars.

Parameters
  • tokens (iterable of str) – Sequence of tokens.

  • minsize (int, optimal) – Minimal length of token (include).

Returns

List of tokens without short tokens.

Return type

list of str

gensim.corpora.textcorpus.remove_stopwords(tokens, stopwords=frozenset({'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'did', 'didn', 'do', 'does', 'doesn', 'doing', 'don', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'just', 'keep', 'kg', 'km', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'make', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'quite', 'rather', 're', 'really', 'regarding', 'same', 'say', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'sincere', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'system', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'thick', 'thin', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'un', 'under', 'unless', 'until', 'up', 'upon', 'us', 'used', 'using', 'various', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves'}))

Remove stopwords using list from gensim.parsing.preprocessing.STOPWORDS.

Parameters
  • tokens (iterable of str) – Sequence of tokens.

  • stopwords (iterable of str, optional) – Sequence of stopwords

Returns

List of tokens without stopwords.

Return type

list of str

gensim.corpora.textcorpus.strip_multiple_whitespaces(s)

Collapse multiple whitespace characters into a single space.

Parameters

s (str) – Input string

Returns

String with collapsed whitespaces.

Return type

str

gensim.corpora.textcorpus.walk(top, topdown=True, onerror=None, followlinks=False, depth=0)

Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 4-tuple (depth, dirpath, dirnames, filenames).

Parameters
  • top (str) – Root directory.

  • topdown (bool, optional) – If True - you can modify dirnames in-place.

  • onerror (function, optional) – Some function, will be called with one argument, an OSError instance. It can report the error to continue with the walk, or raise the exception to abort the walk. Note that the filename is available as the filename attribute of the exception object.

  • followlinks (bool, optional) – If True - visit directories pointed to by symlinks, on systems that support them.

  • depth (int, optional) – Height of file-tree, don’t pass it manually (this used as accumulator for recursion).

Notes

This is a mostly copied version of os.walk from the Python 2 source code. The only difference is that it returns the depth in the directory tree structure at which each yield is taking place.

Yields

(int, str, list of str, list of str) – Depth, current path, visited directories, visited non-directories.