summarization.textcleaner
– Preprocessing for TextRank summarization¶
This module contains functions and processors used for processing text, extracting sentences from text, working with acronyms and abbreviations.
-
gensim.summarization.textcleaner.
clean_text_by_sentences
(text)¶ Tokenize a given text into sentences, applying filters and lemmatize them.
- Parameters
text (str) – Given text.
- Returns
Sentences of the given text.
- Return type
list of
SyntacticUnit
-
gensim.summarization.textcleaner.
clean_text_by_word
(text, deacc=True)¶ Tokenize a given text into words, applying filters and lemmatize them.
- Parameters
text (str) – Given text.
deacc (bool, optional) – Remove accentuation if True.
- Returns
Words as keys,
SyntacticUnit
as values.- Return type
dict
Example
>>> from gensim.summarization.textcleaner import clean_text_by_word >>> clean_text_by_word("God helps those who help themselves") {'god': Original unit: 'god' *-*-*-* Processed unit: 'god', 'help': Original unit: 'help' *-*-*-* Processed unit: 'help', 'helps': Original unit: 'helps' *-*-*-* Processed unit: 'help'}
-
gensim.summarization.textcleaner.
get_sentences
(text)¶ Sentence generator from provided text. Sentence pattern set in
RE_SENTENCE
.- Parameters
text (str) – Input text.
- Yields
str – Single sentence extracted from text.
Example
>>> text = "Does this text contains two sentences? Yes, it does." >>> for sentence in get_sentences(text): >>> print(sentence) Does this text contains two sentences? Yes, it does.
-
gensim.summarization.textcleaner.
join_words
(words, separator=' ')¶ Concatenates words with separator between elements.
- Parameters
words (list of str) – Given words.
separator (str, optional) – The separator between elements.
- Returns
String of merged words with separator between elements.
- Return type
str
-
gensim.summarization.textcleaner.
merge_syntactic_units
(original_units, filtered_units, tags=None)¶ Process given sentences and its filtered (tokenized) copies into
SyntacticUnit
. Also adds tags if they are provided to produced units.- Parameters
original_units (list) – List of original sentences.
filtered_units (list) – List of tokenized sentences.
tags (list of str, optional) – List of strings used as tags for each unit. None as deafault.
- Returns
list of – List of syntactic units (sentences).
- Return type
class:~gensim.summarization.syntactic_unit.SyntacticUnit
-
gensim.summarization.textcleaner.
replace_abbreviations
(text)¶ Replace blank space to ‘@’ separator after abbreviation and next word.
- Parameters
text (str) – Input sentence.
- Returns
Sentence with changed separator.
- Return type
str
Example
>>> replace_abbreviations("God bless you, please, Mrs. Robinson") God bless you, please, Mrs.@Robinson
-
gensim.summarization.textcleaner.
replace_with_separator
(text, separator, regexs)¶ Get text with replaced separator if provided regular expressions were matched.
- Parameters
text (str) – Input text.
separator (str) – The separator between words to be replaced.
regexs (list of _sre.SRE_Pattern) – Regular expressions used in processing text.
- Returns
Text with replaced separators.
- Return type
str
-
gensim.summarization.textcleaner.
split_sentences
(text)¶ Split and get list of sentences from given text. It preserves abbreviations set in
AB_SENIOR
andAB_ACRONYM
.- Parameters
text (str) – Input text.
- Returns
Sentences of given text.
- Return type
list of str
Example
>>> from gensim.summarization.textcleaner import split_sentences >>> text = '''Beautiful is better than ugly. ... Explicit is better than implicit. Simple is better than complex.''' >>> split_sentences(text) ['Beautiful is better than ugly.', 'Explicit is better than implicit.', 'Simple is better than complex.']
-
gensim.summarization.textcleaner.
tokenize_by_word
(text)¶ Tokenize input text. Before tokenizing transforms text to lower case and removes accentuation and acronyms set
AB_ACRONYM_LETTERS
.- Parameters
text (str) – Given text.
- Returns
Generator that yields sequence words of the given text.
- Return type
generator
Example
>>> from gensim.summarization.textcleaner import tokenize_by_word >>> g = tokenize_by_word('Veni. Vedi. Vici.') >>> print(next(g)) veni >>> print(next(g)) vedi >>> print(next(g)) vici
-
gensim.summarization.textcleaner.
undo_replacement
(sentence)¶ Replace @ separator back to blank space after each abbreviation.
- Parameters
sentence (str) – Input sentence.
- Returns
Sentence with changed separator.
- Return type
str
Example
>>> undo_replacement("God bless you, please, Mrs.@Robinson") God bless you, please, Mrs. Robinson