summarization.bm25
– BM25 ranking function¶
This module contains function of computing rank scores for documents in corpus and helper class BM25 used in calculations. Original algorithm descibed in 1, also you may check Wikipedia page 2.
- 1(1,2,3,4,5,6,7)
Robertson, Stephen; Zaragoza, Hugo (2009). The Probabilistic Relevance Framework: BM25 and Beyond, http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf
- 2
Okapi BM25 on Wikipedia, https://en.wikipedia.org/wiki/Okapi_BM25
Examples
>>> from gensim.summarization.bm25 import get_bm25_weights
>>> corpus = [
... ["black", "cat", "white", "cat"],
... ["cat", "outer", "space"],
... ["wag", "dog"]
... ]
>>> result = get_bm25_weights(corpus, n_jobs=-1)
-
class
gensim.summarization.bm25.
BM25
(corpus, k1=1.5, b=0.75, epsilon=0.25)¶ Bases:
object
Implementation of the BM25 (Best Matching 25) ranking function.
-
corpus_size
¶ Size of corpus (number of documents).
- Type
int
-
avgdl
¶ Average length of document in corpus.
- Type
float
-
doc_freqs
¶ Dictionary with terms frequencies for each document in corpus. Words used as keys and frequencies as values.
- Type
list of dicts of int
-
idf
¶ Dictionary with inversed documents frequencies for whole corpus. Words used as keys and frequencies as values.
- Type
dict
-
doc_len
¶ List of document lengths.
- Type
list of int
- Parameters
corpus (list of list of str) – Given corpus.
k1 (float) – Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to 1, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
b (float) – Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to 1, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
epsilon (float) – Constant used as floor value for idf of a document in the corpus. When epsilon is positive, it restricts negative idf values. Negative idf implies that adding a very common term to a document penalize the overall score (with ‘very common’ meaning that it is present in more than half of the documents). That can be undesirable as it means that an identical document would score less than an almost identical one (by removing the referred term). Increasing epsilon above 0 raises the sense of how rare a word has to be (among different documents) to receive an extra score.
-
get_score
(document, index)¶ Computes BM25 score of given document in relation to item of corpus selected by index.
- Parameters
document (list of str) – Document to be scored.
index (int) – Index of document in corpus selected to score with document.
- Returns
BM25 score.
- Return type
float
-
get_scores
(document)¶ Computes and returns BM25 scores of given document in relation to every item in corpus.
- Parameters
document (list of str) – Document to be scored.
- Returns
BM25 scores.
- Return type
list of float
-
get_scores_bow
(document)¶ Computes and returns BM25 scores of given document in relation to every item in corpus.
- Parameters
document (list of str) – Document to be scored.
- Returns
BM25 scores.
- Return type
list of float
-
-
gensim.summarization.bm25.
get_bm25_weights
(corpus, n_jobs=1, k1=1.5, b=0.75, epsilon=0.25)¶ Returns BM25 scores (weights) of documents in corpus. Each document has to be weighted with every document in given corpus.
- Parameters
corpus (list of list of str) – Corpus of documents.
n_jobs (int) – The number of processes to use for computing bm25.
k1 (float) – Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to 1, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
b (float) – Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to 1, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
epsilon (float) – Constant used as floor value for idf of a document in the corpus. When epsilon is positive, it restricts negative idf values. Negative idf implies that adding a very common term to a document penalize the overall score (with ‘very common’ meaning that it is present in more than half of the documents). That can be undesirable as it means that an identical document would score less than an almost identical one (by removing the referred term). Increasing epsilon above 0 raises the sense of how rare a word has to be (among different documents) to receive an extra score.
- Returns
BM25 scores.
- Return type
list of list of float
Examples
>>> from gensim.summarization.bm25 import get_bm25_weights >>> corpus = [ ... ["black", "cat", "white", "cat"], ... ["cat", "outer", "space"], ... ["wag", "dog"] ... ] >>> result = get_bm25_weights(corpus, n_jobs=-1)
-
gensim.summarization.bm25.
iter_bm25_bow
(corpus, n_jobs=1, k1=1.5, b=0.75, epsilon=0.25)¶ Yield BM25 scores (weights) of documents in corpus. Each document has to be weighted with every document in given corpus.
- Parameters
corpus (list of list of str) – Corpus of documents.
n_jobs (int) – The number of processes to use for computing bm25.
k1 (float) – Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to 1, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
b (float) – Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to 1, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
epsilon (float) – Constant used as floor value for idf of a document in the corpus. When epsilon is positive, it restricts negative idf values. Negative idf implies that adding a very common term to a document penalize the overall score (with ‘very common’ meaning that it is present in more than half of the documents). That can be undesirable as it means that an identical document would score less than an almost identical one (by removing the referred term). Increasing epsilon above 0 raises the sense of how rare a word has to be (among different documents) to receive an extra score.
- Yields
list of (index, float) – BM25 scores in bag of weights format.
Examples
>>> from gensim.summarization.bm25 import iter_bm25_weights >>> corpus = [ ... ["black", "cat", "white", "cat"], ... ["cat", "outer", "space"], ... ["wag", "dog"] ... ] >>> result = iter_bm25_weights(corpus, n_jobs=-1)