summarization.mz_entropy – Keywords for the Montemurro and Zanette entropy algorithm

gensim.summarization.mz_entropy.count_freqs_by_blocks(words, vocab, blocksize)

Count word frequencies in chunks

Parameters
  • words (list(str)) – List of all words.

  • vocab (list(str)) – List of words in vocabulary.

  • blocksize (int) – Size of blocks to use for count.

Returns

results – Array of list of word frequencies in one chunk. The order of word frequencies is the same as words in vocab.

Return type

numpy.array(list(double))

gensim.summarization.mz_entropy.mz_keywords(text, blocksize=1024, scores=False, split=False, weighted=True, threshold=0.0)

Extract keywords from text using the Montemurro and Zanette entropy algorithm. 1

Parameters
  • text (str) – Document for summarization.

  • blocksize (int, optional) – Size of blocks to use in analysis.

  • scores (bool, optional) – Whether to return score with keywords.

  • split (bool, optional) – Whether to return results as list.

  • weighted (bool, optional) – Whether to weight scores by word frequency. False can useful for shorter texts, and allows automatic thresholding.

  • threshold (float or 'auto', optional) – Minimum score for returned keywords, ‘auto’ calculates the threshold as n_blocks / (n_blocks + 1.0) + 1e-8, use ‘auto’ with weighted=False.

Returns

  • results (str) – newline separated keywords if split == False OR

  • results (list(str)) – list of keywords if scores == False OR

  • results (list(tuple(str, float))) – list of (keyword, score) tuples if scores == True

  • Results are returned in descending order of score regardless of the format.

Note

This algorithm looks for keywords that contribute to the structure of the text on scales of blocksize words of larger. It is suitable for extracting keywords representing the major themes of long texts.

References

1

Marcello A Montemurro, Damian Zanette, “Towards the quantification of the semantic information encoded in written language”. Advances in Complex Systems, Volume 13, Issue 2 (2010), pp. 135-153, DOI: 10.1142/S0219525910002530, https://arxiv.org/abs/0907.1558