Topic modelling
for humans
Gensim is a FREE Python library
from gensim.models.word2vec import Word2Vec
import gensim.downloader

# Train NLP models efficiently from your own text data.
# Gensim uses data streaming, without loading the whole corpus into RAM,
# so you can process data of any size.
corpus = gensim.downloader.load('text8')
model = Word2Vec(sentences=corpus, vector_size=100, workers=4)  # train word2vec model

# Use the trained model for common NLP tasks: similarity, input for deep learning, etc.
word_vectors = model.wv

word_vectors['dog']
array([ 0.4946671 , …, -2.0474846 ], dtype=float32)

word_vectors.most_similar('dog', topn=5)
[('cat', 0.8329264521598816), ('hound', 0.7972617745399475), …, ('cow', 0.7768632769584656)]

# Or, download an already pre-trained model:
glove_vectors = gensim.downloader.load('glove-twitter-25')
glove_vectors.most_similar('twitter', topn=5)
[('facebook', 0.948005199432373), ('tweet', 0.9403423070907593), …, ('chat', 0.8964964747428894)]

Library includes

Features of gensim phyton library

Scalability


Gensim can process large, web-scale corpora, using incremental online training algorithms. There is no need for the whole input corpus to reside fully in RAM at any one time.

Platform independent


Being pure Python, gensim runs on Linux, Windows and OS X, as well as any other platform that supports Python and NumPy.

Robust


Gensim has been in use in various systems by various people and organizations for over 4 years. It's well past the initial “look mom, I published a script“ stage of open-source projects.

Open source


The GNU LGPL license allows both personal and commercial use, provided any modifications to gensim itself are in turn open-sourced. Other modes (dual licensing) are also possible.

Efficient implementations


The core algorithms in gensim use highly optimized math routines. Gensim also contains a distributed version of several algorithms, intended to speed up processing and retrieval on machine clusters.

Converters & I/O formats


Gensim contains memory-efficient implementations to several popular data formats: Matrix Market, SVMlight, Blei's LDA-C... These can be used for input, output, or to convert between one another.

Similarity queries


As a natural next step to topic modelling, gensim also contains code for fast indexing of documents in their semantic representation, and retrieval of topically similar documents.

Support


Gensim is supported and maintained by means of community effort. See the support page for information on using the mailing list, tutorials, FAQ, code hosting and instructions for contributors.

Installation


Quick install

Run in your terminal (recommended):

pip install --upgrade gensim

or, alternatively for conda environments:

conda install -c conda-forge gensim

That's it! Congratulations, you can proceed to the tutorials.
In case that failed, make sure you're installing into a writeable location.

Code dependencies

Gensim runs on Linux, Windows and Mac OS X, and should run on any other platform that supports Python 2.7 or 3.5+ and NumPy. Gensim depends on the following software:

  • Python , tested with versions 2.7, 3.5, 3.6 and 3.7.
  • NumPy for number crunching.
  • smart_open for transparently opening files on remote storages or compressed files.

Testing Gensim

Gensim uses continuous integration, automatically running a full test suite on each pull request with

CI service Task Build badge
Travis Run tests on Linux and check code-style Travis
AppVeyor Run tests on Windows AppVeyor
CircleCI Build documentation CircleCI

Problems?

Use the Gensim discussion group for questions and troubleshooting. See the support page for commercial support.

Who is using Gensim?

Doing something interesting with Gensim? Ask to be featured here.

  • “Here at Tailwind, we use Gensim to help our customers post interesting and relevant content to Pinterest. No fuss, no muss. Just fast, scalable language processing.”

    Waylon Flinn
    Tailwind
  • “We are using Gensim every day. Over 15 thousand times per day to be precise. Gensim’s LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it’s all about. It simply works.”

    Andrius Butkus
    Issuu
  • “Gensim hits the sweetest spot of being a simple yet powerful way to access some incredibly complex NLP goodness.”

    Alan J. Salmoni
    Roistr.com
  • “I used Gensim at Ghent university. I found it easy to build prototypes with various models, extend it with additional features and gain empirical insights quickly. It's a reliable library that can be used beyond prototyping too.”

  • “We used Gensim in several text mining projects at Sports Authority. The data were from free-form text fields in customer surveys, as well as social media sources. Having Gensim significantly sped our time to development, and it is still my go-to package for topic modeling with large retail data sets.”

    Josh Hemann
    Sports Authority
  • “Semantic analysis is a hot topic in online marketing, but there are few products on the market that are truly powerful. Gensim is undoubtedly one of the best frameworks that efficiently implement algorithms for statistical analysis. Few products, even commercial, have this level of quality.”

    Bruno Champion
    DynAdmic
  • “Based on our experience with Gensim on DML-CZ, we naturally opted to use it on a much bigger scale for similarity of fulltexts of scientific papers in the European Digital Mathematics Library. In evaluation with other approaches, Gensim became a clear winner, especially because of speed, scalability and ease of use.”

    Petr Sojka
    EuDML
  • “We have been using Gensim in several DTU courses related to digital media engineering and find it immensely useful as the tutorial material provides students an excellent introduction to quickly understand the underlying principles in topic modeling based on both LSA and LDA.”

 

Forever-free open-source

Gensim is licensed under the OSI-approved GNU LGPLv2.1 license
and can be downloaded either from its Github repository or from the Python Package Index.

Download now