sklearn_api.hdp
– Scikit learn wrapper for Hierarchical Dirichlet Process model¶
Scikit learn interface for HdpModel
.
Follows scikit-learn API conventions to facilitate using gensim along with scikit-learn.
Examples
>>> from gensim.test.utils import common_dictionary, common_corpus
>>> from gensim.sklearn_api import HdpTransformer
>>>
>>> # Lets extract the distribution of each document in topics
>>> model = HdpTransformer(id2word=common_dictionary)
>>> distr = model.fit_transform(common_corpus)
-
class
gensim.sklearn_api.hdp.
HdpTransformer
(id2word, max_chunks=None, max_time=None, chunksize=256, kappa=1.0, tau=64.0, K=15, T=150, alpha=1, gamma=1, eta=0.01, scale=1.0, var_converge=0.0001, outputdir=None, random_state=None)¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Base HDP module, wraps
HdpModel
.The inner workings of this class heavily depends on Wang, Paisley, Blei: “Online Variational Inference for the Hierarchical Dirichlet Process, JMLR (2011)”.
- Parameters
id2word (
Dictionary
, optional) – Mapping between a words ID and the word itself in the vocabulary.max_chunks (int, optional) – Upper bound on how many chunks to process.It wraps around corpus beginning in another corpus pass, if there are not enough chunks in the corpus.
max_time (int, optional) – Upper bound on time in seconds for which model will be trained.
chunksize (int, optional) – Number of documents to be processed by the model in each mini-batch.
kappa (float, optional) –
Learning rate, see Wang, Paisley, Blei: “Online Variational Inference for the Hierarchical Dirichlet Process, JMLR (2011)”.
tau (float, optional) –
Slow down parameter, see Wang, Paisley, Blei: “Online Variational Inference for the Hierarchical Dirichlet Process, JMLR (2011)”.
K (int, optional) –
Second level truncation level, see Wang, Paisley, Blei: “Online Variational Inference for the Hierarchical Dirichlet Process, JMLR (2011)”.
T (int, optional) –
Top level truncation level, see Wang, Paisley, Blei: “Online Variational Inference for the Hierarchical Dirichlet Process, JMLR (2011)”.
alpha (int, optional) –
Second level concentration, see Wang, Paisley, Blei: “Online Variational Inference for the Hierarchical Dirichlet Process, JMLR (2011)”.
gamma (int, optional) –
First level concentration, see Wang, Paisley, Blei: “Online Variational Inference for the Hierarchical Dirichlet Process, JMLR (2011)”.
eta (float, optional) –
The topic Dirichlet, see Wang, Paisley, Blei: “Online Variational Inference for the Hierarchical Dirichlet Process, JMLR (2011)”.
scale (float, optional) – Weights information from the mini-chunk of corpus to calculate rhot.
var_converge (float, optional) – Lower bound on the right side of convergence. Used when updating variational parameters for a single document.
outputdir (str, optional) – Path to a directory where topic and options information will be stored.
random_state (int, optional) – Seed used to create a
RandomState
. Useful for obtaining reproducible results.
-
fit
(X, y=None)¶ Fit the model according to the given training data.
- Parameters
X ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format used for training the model.
- Returns
The trained model.
- Return type
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X ({array-like, sparse matrix, dataframe} of shape (n_samples, n_features)) –
y (ndarray of shape (n_samples,), default=None) – Target values.
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
params – Parameter names mapped to their values.
- Return type
mapping of string to any
-
partial_fit
(X)¶ Train model over a potentially incomplete set of documents.
Uses the parameters set in the constructor. This method can be used in two ways: * On an unfitted model in which case the model is initialized and trained on X. * On an already fitted model in which case the model is updated by X.
- Parameters
X ({iterable of list of (int, number), scipy.sparse matrix}) – A collection of documents in BOW format used for training the model.
- Returns
The trained model.
- Return type
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
**params (dict) – Estimator parameters.
- Returns
self – Estimator instance.
- Return type
object
-
transform
(docs)¶ Infer a matrix of topic distribution for the given document bow, where a_ij indicates (topic_i, topic_probability_j).
- Parameters
docs ({iterable of list of (int, number), list of (int, number)}) – Document or sequence of documents in BOW format.
- Returns
Topic distribution for docs.
- Return type
numpy.ndarray of shape [len(docs), num_topics]