summarization.mz_entropy
– Keywords for the Montemurro and Zanette entropy algorithm¶
-
gensim.summarization.mz_entropy.
count_freqs_by_blocks
(words, vocab, blocksize)¶ Count word frequencies in chunks
- Parameters
words (list(str)) – List of all words.
vocab (list(str)) – List of words in vocabulary.
blocksize (int) – Size of blocks to use for count.
- Returns
results – Array of list of word frequencies in one chunk. The order of word frequencies is the same as words in vocab.
- Return type
numpy.array(list(double))
-
gensim.summarization.mz_entropy.
mz_keywords
(text, blocksize=1024, scores=False, split=False, weighted=True, threshold=0.0)¶ Extract keywords from text using the Montemurro and Zanette entropy algorithm. 1
- Parameters
text (str) – Document for summarization.
blocksize (int, optional) – Size of blocks to use in analysis.
scores (bool, optional) – Whether to return score with keywords.
split (bool, optional) – Whether to return results as list.
weighted (bool, optional) – Whether to weight scores by word frequency. False can useful for shorter texts, and allows automatic thresholding.
threshold (float or 'auto', optional) – Minimum score for returned keywords, ‘auto’ calculates the threshold as n_blocks / (n_blocks + 1.0) + 1e-8, use ‘auto’ with weighted=False.
- Returns
results (str) – newline separated keywords if split == False OR
results (list(str)) – list of keywords if scores == False OR
results (list(tuple(str, float))) – list of (keyword, score) tuples if scores == True
Results are returned in descending order of score regardless of the format.
Note
This algorithm looks for keywords that contribute to the structure of the text on scales of blocksize words of larger. It is suitable for extracting keywords representing the major themes of long texts.
References
- 1
Marcello A Montemurro, Damian Zanette, “Towards the quantification of the semantic information encoded in written language”. Advances in Complex Systems, Volume 13, Issue 2 (2010), pp. 135-153, DOI: 10.1142/S0219525910002530, https://arxiv.org/abs/0907.1558