corpora.wikicorpus
– Corpus from a Wikipedia dump¶
Construct a corpus from a Wikipedia (or other MediaWiki-based) database dump.
Uses multiprocessing internally to parallelize the work and process the dump more quickly.
Notes
If you have the pattern package installed, this module will use a fancy lemmatization to get a lemma of each token (instead of plain alphabetic tokenizer).
See gensim.scripts.make_wiki
for a canned (example) command-line script based on this module.
-
gensim.corpora.wikicorpus.
ARTICLE_MIN_WORDS
= 50¶ Ignore shorter articles (after full preprocessing).
-
gensim.corpora.wikicorpus.
IGNORED_NAMESPACES
= ['Wikipedia', 'Category', 'File', 'Portal', 'Template', 'MediaWiki', 'User', 'Help', 'Book', 'Draft', 'WikiProject', 'Special', 'Talk']¶ MediaWiki namespaces that ought to be ignored.
-
gensim.corpora.wikicorpus.
RE_P0
= re.compile('<!--.*?-->', re.DOTALL)¶ Comments.
-
gensim.corpora.wikicorpus.
RE_P1
= re.compile('<ref([> ].*?)(</ref>|/>)', re.DOTALL)¶ Footnotes.
-
gensim.corpora.wikicorpus.
RE_P10
= re.compile('<math([> ].*?)(</math>|/>)', re.DOTALL)¶ Math content.
-
gensim.corpora.wikicorpus.
RE_P11
= re.compile('<(.*?)>', re.DOTALL)¶ All other tags.
-
gensim.corpora.wikicorpus.
RE_P12
= re.compile('(({\\|)|(\\|-(?!\\d))|(\\|}))(.*?)(?=\\n)')¶ Table formatting.
-
gensim.corpora.wikicorpus.
RE_P13
= re.compile('(?<=(\\n[ ])|(\\n\\n)|([ ]{2})|(.\\n)|(.\\t))(\\||\\!)([^[\\]\\n]*?\\|)*')¶ Table cell formatting.
-
gensim.corpora.wikicorpus.
RE_P14
= re.compile('\\[\\[Category:[^][]*\\]\\]')¶ Categories.
-
gensim.corpora.wikicorpus.
RE_P15
= re.compile('\\[\\[([fF]ile:|[iI]mage)[^]]*(\\]\\])')¶ Remove File and Image templates.
-
gensim.corpora.wikicorpus.
RE_P16
= re.compile('\\[{2}(.*?)\\]{2}')¶ Capture interlinks text and article linked
-
gensim.corpora.wikicorpus.
RE_P17
= re.compile('(\\n.{0,4}((bgcolor)|(\\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=)|(scope=))(.*))|(^.{0,2}((bgcolor)|(\\d{0,1}[ ]?colspan)|(rowspan)|(style=)|(class=)|(align=))(.*))')¶ Table markup
-
gensim.corpora.wikicorpus.
RE_P2
= re.compile('(\\n\\[\\[[a-z][a-z][\\w-]*:[^:\\]]+\\]\\])+$')¶ Links to languages.
-
gensim.corpora.wikicorpus.
RE_P3
= re.compile('{{([^}{]*)}}', re.DOTALL)¶ Template.
-
gensim.corpora.wikicorpus.
RE_P4
= re.compile('{{([^}]*)}}', re.DOTALL)¶ Template.
-
gensim.corpora.wikicorpus.
RE_P5
= re.compile('\\[(\\w+):\\/\\/(.*?)(( (.*?))|())\\]')¶ Remove URL, keep description.
-
gensim.corpora.wikicorpus.
RE_P6
= re.compile('\\[([^][]*)\\|([^][]*)\\]', re.DOTALL)¶ Simplify links, keep description.
-
gensim.corpora.wikicorpus.
RE_P7
= re.compile('\\n\\[\\[[iI]mage(.*?)(\\|.*?)*\\|(.*?)\\]\\]')¶ Keep description of images.
-
gensim.corpora.wikicorpus.
RE_P8
= re.compile('\\n\\[\\[[fF]ile(.*?)(\\|.*?)*\\|(.*?)\\]\\]')¶ Keep description of files.
-
gensim.corpora.wikicorpus.
RE_P9
= re.compile('<nowiki([> ].*?)(</nowiki>|/>)', re.DOTALL)¶ External links.
-
class
gensim.corpora.wikicorpus.
WikiCorpus
(fname, processes=None, lemmatize=False, dictionary=None, filter_namespaces=('0', ), tokenizer_func=<function tokenize>, article_min_tokens=50, token_min_len=2, token_max_len=15, lower=True, filter_articles=None)¶ Bases:
gensim.corpora.textcorpus.TextCorpus
Treat a Wikipedia articles dump as a read-only, streamed, memory-efficient corpus.
Supported dump formats:
<LANG>wiki-<YYYYMMDD>-pages-articles.xml.bz2
<LANG>wiki-latest-pages-articles.xml.bz2
The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.
Notes
Dumps for the English Wikipedia can be founded at https://dumps.wikimedia.org/enwiki/.
-
metadata
¶ Whether to write articles titles to serialized corpus.
- Type
bool
Warning
“Multistream” archives are not supported in Python 2 due to limitations in the core bz2 library.
Examples
>>> from gensim.test.utils import datapath, get_tmpfile >>> from gensim.corpora import WikiCorpus, MmCorpus >>> >>> path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2") >>> corpus_path = get_tmpfile("wiki-corpus.mm") >>> >>> wiki = WikiCorpus(path_to_wiki_dump) # create word->word_id mapping, ~8h on full wiki >>> MmCorpus.serialize(corpus_path, wiki) # another 8h, creates a file in MatrixMarket format and mapping
Initialize the corpus.
Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.
- Parameters
fname (str) – Path to the Wikipedia dump file.
processes (int, optional) – Number of processes to run, defaults to max(1, number of cpu - 1).
lemmatize (bool) –
Use lemmatization instead of simple regexp tokenization. Defaults to True if you have the pattern package installed.
dictionary (
Dictionary
, optional) – Dictionary, if not provided, this scans the corpus once, to determine its vocabulary IMPORTANT: this needs a really long time.filter_namespaces (tuple of str, optional) – Namespaces to consider.
tokenizer_func (function, optional) – Function that will be used for tokenization. By default, use
tokenize()
. If you inject your own tokenizer, it must conform to this interface: tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of strarticle_min_tokens (int, optional) – Minimum tokens in article. Article will be ignored if number of tokens is less.
token_min_len (int, optional) – Minimal token length.
token_max_len (int, optional) – Maximal token length.
lower (bool, optional) – If True - convert all text to lower case.
filter_articles (callable or None, optional) – If set, each XML article element will be passed to this callable before being processed. Only articles where the callable returns an XML element are processed, returning None allows filtering out some articles based on customised rules.
Warning
Unless a dictionary is provided, this scans the corpus once, to determine its vocabulary.
-
get_texts
()¶ Iterate over the dump, yielding a list of tokens for each article that passed the length and namespace filtering.
Uses multiprocessing internally to parallelize the work and process the dump more quickly.
Notes
This iterates over the texts. If you want vectors, just use the standard corpus interface instead of this method:
Examples
>>> from gensim.test.utils import datapath >>> from gensim.corpora import WikiCorpus >>> >>> path_to_wiki_dump = datapath("enwiki-latest-pages-articles1.xml-p000000010p000030302-shortened.bz2") >>> >>> for vec in WikiCorpus(path_to_wiki_dump): ... pass
- Yields
list of str – If metadata is False, yield only list of token extracted from the article.
(list of str, (int, str)) – List of tokens (extracted from the article), page id and article title otherwise.
-
getstream
()¶ Generate documents from the underlying plain text collection (of one or more files).
- Yields
str – Document read from plain-text file.
Notes
After generator end - initialize self.length attribute.
-
init_dictionary
(dictionary)¶ Initialize/update dictionary.
- Parameters
dictionary (
Dictionary
, optional) – If a dictionary is provided, it will not be updated with the given corpus on initialization. If None - new dictionary will be built for the given corpus.
Notes
If self.input is None - make nothing.
-
property
input
¶
-
classmethod
load
(fname, mmap=None)¶ Load an object previously saved using
save()
from a file.- Parameters
fname (str) – Path to file that contains needed object.
mmap (str, optional) – Memory-map option. If the object was saved with large arrays stored separately, you can load these arrays via mmap (shared memory) using mmap=’r’. If the file being loaded is compressed (either ‘.gz’ or ‘.bz2’), then `mmap=None must be set.
See also
save()
Save object to file.
- Returns
Object loaded from fname.
- Return type
object
- Raises
AttributeError – When called on an object instance instead of class (this is a class method).
-
preprocess_text
(text)¶ Apply self.character_filters, self.tokenizer, self.token_filters to a single text document.
- Parameters
text (str) – Document read from plain-text file.
- Returns
List of tokens extracted from text.
- Return type
list of str
-
sample_texts
(n, seed=None, length=None)¶ Generate n random documents from the corpus without replacement.
- Parameters
n (int) – Number of documents we want to sample.
seed (int, optional) – If specified, use it as a seed for local random generator.
length (int, optional) – Value will used as corpus length (because calculate length of corpus can be costly operation). If not specified - will call __length__.
- Raises
ValueError – If n less than zero or greater than corpus size.
Notes
Given the number of remaining documents in a corpus, we need to choose n elements. The probability for the current element to be chosen is n / remaining. If we choose it, we just decrease the n and move to the next element.
- Yields
list of str – Sampled document as sequence of tokens.
-
save
(*args, **kwargs)¶ Saves corpus in-memory state.
Warning
This save only the “state” of a corpus class, not the corpus data!
For saving data use the serialize method of the output format you’d like to use (e.g.
gensim.corpora.mmcorpus.MmCorpus.serialize()
).
-
static
save_corpus
(fname, corpus, id2word=None, metadata=False)¶ Save corpus to disk.
Some formats support saving the dictionary (feature_id -> word mapping), which can be provided by the optional id2word parameter.
Notes
Some corpora also support random access via document indexing, so that the documents on disk can be accessed in O(1) time (see the
gensim.corpora.indexedcorpus.IndexedCorpus
base class).In this case,
save_corpus()
is automatically called internally byserialize()
, which doessave_corpus()
plus saves the index at the same time.Calling
serialize() is preferred to calling :meth:`gensim.interfaces.CorpusABC.save_corpus()
.- Parameters
fname (str) – Path to output file.
corpus (iterable of list of (int, number)) – Corpus in BoW format.
id2word (
Dictionary
, optional) – Dictionary of corpus.metadata (bool, optional) – Write additional metadata to a separate too?
-
step_through_preprocess
(text)¶ Apply preprocessor one by one and generate result.
Warning
This is useful for debugging issues with the corpus preprocessing pipeline.
- Parameters
text (str) – Document text read from plain-text file.
- Yields
(callable, object) – Pre-processor, output from pre-processor (based on text)
-
gensim.corpora.wikicorpus.
extract_pages
(f, filter_namespaces=False, filter_articles=None)¶ Extract pages from a MediaWiki database dump.
- Parameters
f (file) – File-like object.
filter_namespaces (list of str or bool) – Namespaces that will be extracted.
- Yields
tuple of (str or None, str, str) – Title, text and page id.
-
gensim.corpora.wikicorpus.
filter_example
(elem, text, *args, **kwargs)¶ Example function for filtering arbitrary documents from wikipedia dump.
The custom filter function is called _before_ tokenisation and should work on the raw text and/or XML element information.
The filter function gets the entire context of the XML element passed into it, but you can of course choose not the use some or all parts of the context. Please refer to
gensim.corpora.wikicorpus.extract_pages()
for the exact details of the page context.- Parameters
elem (etree.Element) – XML etree element
text (str) – The text of the XML node
namespace (str) – XML namespace of the XML element
title (str) – Page title
page_tag (str) – XPath expression for page.
text_path (str) – XPath expression for text.
title_path (str) – XPath expression for title.
ns_path (str) – XPath expression for namespace.
pageid_path (str) – XPath expression for page id.
Example
>>> import gensim.corpora >>> filter_func = gensim.corpora.wikicorpus.filter_example >>> dewiki = gensim.corpora.WikiCorpus( ... './dewiki-20180520-pages-articles-multistream.xml.bz2', ... filter_articles=filter_func)
-
gensim.corpora.wikicorpus.
filter_wiki
(raw, promote_remaining=True, simplify_links=True)¶ Filter out wiki markup from raw, leaving only text.
- Parameters
raw (str) – Unicode or utf-8 encoded string.
promote_remaining (bool) – Whether uncaught markup should be promoted to plain text.
simplify_links (bool) – Whether links should be simplified keeping only their description text.
- Returns
raw without markup.
- Return type
str
-
gensim.corpora.wikicorpus.
find_interlinks
(raw)¶ Find all interlinks to other articles in the dump.
- Parameters
raw (str) – Unicode or utf-8 encoded string.
- Returns
List of tuples in format [(linked article, the actual text found), …].
- Return type
list
-
gensim.corpora.wikicorpus.
get_namespace
(tag)¶ Get the namespace of tag.
- Parameters
tag (str) – Namespace or tag.
- Returns
Matched namespace or tag.
- Return type
str
-
gensim.corpora.wikicorpus.
init_to_ignore_interrupt
()¶ Enables interruption ignoring.
Warning
Should only be used when master is prepared to handle termination of child processes.
-
gensim.corpora.wikicorpus.
process_article
(args, tokenizer_func=<function tokenize>, token_min_len=2, token_max_len=15, lower=True)¶ Parse a Wikipedia article, extract all tokens.
Notes
Set tokenizer_func (defaults is
tokenize()
) parameter for languages like Japanese or Thai to perform better tokenization. The tokenizer_func needs to take 4 parameters: (text: str, token_min_len: int, token_max_len: int, lower: bool).- Parameters
args ((str, bool, str, int)) – Article text, lemmatize flag (if True,
lemmatize()
will be used), article title, page identificator.tokenizer_func (function) – Function for tokenization (defaults is
tokenize()
). Needs to have interface: tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list of str.token_min_len (int) – Minimal token length.
token_max_len (int) – Maximal token length.
lower (bool) – Convert article text to lower case?
- Returns
List of tokens from article, title and page id.
- Return type
(list of str, str, int)
-
gensim.corpora.wikicorpus.
remove_file
(s)¶ Remove the ‘File:’ and ‘Image:’ markup, keeping the file caption.
- Parameters
s (str) – String containing ‘File:’ and ‘Image:’ markup.
- Returns
Сopy of s with all the ‘File:’ and ‘Image:’ markup replaced by their corresponding captions.
- Return type
str
-
gensim.corpora.wikicorpus.
remove_markup
(text, promote_remaining=True, simplify_links=True)¶ Filter out wiki markup from text, leaving only text.
- Parameters
text (str) – String containing markup.
promote_remaining (bool) – Whether uncaught markup should be promoted to plain text.
simplify_links (bool) – Whether links should be simplified keeping only their description text.
- Returns
text without markup.
- Return type
str
-
gensim.corpora.wikicorpus.
remove_template
(s)¶ Remove template wikimedia markup.
- Parameters
s (str) – String containing markup template.
- Returns
Сopy of s with all the wikimedia markup template removed.
- Return type
str
Notes
Since template can be nested, it is difficult remove them using regular expressions.
-
gensim.corpora.wikicorpus.
tokenize
(content, token_min_len=2, token_max_len=15, lower=True)¶ Tokenize a piece of text from Wikipedia.
Set token_min_len, token_max_len as character length (not bytes!) thresholds for individual tokens.
- Parameters
content (str) – String without markup (see
filter_wiki()
).token_min_len (int) – Minimal token length.
token_max_len (int) – Maximal token length.
lower (bool) – Convert content to lower case?
- Returns
List of tokens from content.
- Return type
list of str