summarization.bm25 – BM25 ranking function

This module contains function of computing rank scores for documents in corpus and helper class BM25 used in calculations. Original algorithm descibed in 1, also you may check Wikipedia page 2.

1(1,2,3,4,5,6,7)

Robertson, Stephen; Zaragoza, Hugo (2009). The Probabilistic Relevance Framework: BM25 and Beyond, http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf

2

Okapi BM25 on Wikipedia, https://en.wikipedia.org/wiki/Okapi_BM25

Examples

>>> from gensim.summarization.bm25 import get_bm25_weights
>>> corpus = [
...     ["black", "cat", "white", "cat"],
...     ["cat", "outer", "space"],
...     ["wag", "dog"]
... ]
>>> result = get_bm25_weights(corpus, n_jobs=-1)
class gensim.summarization.bm25.BM25(corpus, k1=1.5, b=0.75, epsilon=0.25)

Bases: object

Implementation of the BM25 (Best Matching 25) ranking function.

corpus_size

Size of corpus (number of documents).

Type

int

avgdl

Average length of document in corpus.

Type

float

doc_freqs

Dictionary with terms frequencies for each document in corpus. Words used as keys and frequencies as values.

Type

list of dicts of int

idf

Dictionary with inversed documents frequencies for whole corpus. Words used as keys and frequencies as values.

Type

dict

doc_len

List of document lengths.

Type

list of int

Parameters
  • corpus (list of list of str) – Given corpus.

  • k1 (float) – Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to 1, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.

  • b (float) – Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to 1, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.

  • epsilon (float) – Constant used as floor value for idf of a document in the corpus. When epsilon is positive, it restricts negative idf values. Negative idf implies that adding a very common term to a document penalize the overall score (with ‘very common’ meaning that it is present in more than half of the documents). That can be undesirable as it means that an identical document would score less than an almost identical one (by removing the referred term). Increasing epsilon above 0 raises the sense of how rare a word has to be (among different documents) to receive an extra score.

get_score(document, index)

Computes BM25 score of given document in relation to item of corpus selected by index.

Parameters
  • document (list of str) – Document to be scored.

  • index (int) – Index of document in corpus selected to score with document.

Returns

BM25 score.

Return type

float

get_scores(document)

Computes and returns BM25 scores of given document in relation to every item in corpus.

Parameters

document (list of str) – Document to be scored.

Returns

BM25 scores.

Return type

list of float

get_scores_bow(document)

Computes and returns BM25 scores of given document in relation to every item in corpus.

Parameters

document (list of str) – Document to be scored.

Returns

BM25 scores.

Return type

list of float

gensim.summarization.bm25.get_bm25_weights(corpus, n_jobs=1, k1=1.5, b=0.75, epsilon=0.25)

Returns BM25 scores (weights) of documents in corpus. Each document has to be weighted with every document in given corpus.

Parameters
  • corpus (list of list of str) – Corpus of documents.

  • n_jobs (int) – The number of processes to use for computing bm25.

  • k1 (float) – Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to 1, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.

  • b (float) – Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to 1, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.

  • epsilon (float) – Constant used as floor value for idf of a document in the corpus. When epsilon is positive, it restricts negative idf values. Negative idf implies that adding a very common term to a document penalize the overall score (with ‘very common’ meaning that it is present in more than half of the documents). That can be undesirable as it means that an identical document would score less than an almost identical one (by removing the referred term). Increasing epsilon above 0 raises the sense of how rare a word has to be (among different documents) to receive an extra score.

Returns

BM25 scores.

Return type

list of list of float

Examples

>>> from gensim.summarization.bm25 import get_bm25_weights
>>> corpus = [
...     ["black", "cat", "white", "cat"],
...     ["cat", "outer", "space"],
...     ["wag", "dog"]
... ]
>>> result = get_bm25_weights(corpus, n_jobs=-1)
gensim.summarization.bm25.iter_bm25_bow(corpus, n_jobs=1, k1=1.5, b=0.75, epsilon=0.25)

Yield BM25 scores (weights) of documents in corpus. Each document has to be weighted with every document in given corpus.

Parameters
  • corpus (list of list of str) – Corpus of documents.

  • n_jobs (int) – The number of processes to use for computing bm25.

  • k1 (float) – Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to 1, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.

  • b (float) – Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to 1, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.

  • epsilon (float) – Constant used as floor value for idf of a document in the corpus. When epsilon is positive, it restricts negative idf values. Negative idf implies that adding a very common term to a document penalize the overall score (with ‘very common’ meaning that it is present in more than half of the documents). That can be undesirable as it means that an identical document would score less than an almost identical one (by removing the referred term). Increasing epsilon above 0 raises the sense of how rare a word has to be (among different documents) to receive an extra score.

Yields

list of (index, float) – BM25 scores in bag of weights format.

Examples

>>> from gensim.summarization.bm25 import iter_bm25_weights
>>> corpus = [
...     ["black", "cat", "white", "cat"],
...     ["cat", "outer", "space"],
...     ["wag", "dog"]
... ]
>>> result = iter_bm25_weights(corpus, n_jobs=-1)