Gensim: Topic modelling for humans

Library includes

Scalable statistical semantics

Analyze plain-text documents for semantic structure

Retrieve semantically similar documents

Features of gensim phyton library

Scalability

Gensim can process large, web-scale corpora, using incremental online training algorithms. There is no need for the whole input corpus to reside fully in RAM at any one time.

Platform independent

Being pure Python, gensim runs on Linux, Windows and OS X, as well as any other platform that supports Python and NumPy.

Robust

Gensim has been in use in various systems by various people and organizations for over 4 years. It's well past the initial “look mom, I published a script“ stage of open-source projects.

Open source

The GNU LGPL license allows both personal and commercial use, provided any modifications to gensim itself are in turn open-sourced. Other modes (dual licensing) are also possible.

Efficient implementations

The core algorithms in gensim use highly optimized math routines. Gensim also contains a distributed version of several algorithms, intended to speed up processing and retrieval on machine clusters.

Converters & I/O formats

Gensim contains memory-efficient implementations to several popular data formats: Matrix Market, SVMlight, Blei's LDA-C... These can be used for input, output, or to convert between one another.

Similarity queries

As a natural next step to topic modelling, gensim also contains code for fast indexing of documents in their semantic representation, and retrieval of topically similar documents.

Support

Gensim is supported and maintained by means of community effort. See the support page for information on using the mailing list, tutorials, FAQ, code hosting and instructions for contributors.

Installation

Quick install

Run in your terminal (recommended):

pip install --upgrade gensim

or, alternatively for conda environments:

conda install -c conda-forge gensim

That's it! Congratulations, you can proceed to the tutorials.
In case that failed, make sure you're installing into a writeable location.

Code dependencies

Gensim runs on Linux, Windows and Mac OS X, and should run on any other platform that supports Python 2.7 or 3.5+ and NumPy. Gensim depends on the following software:

Python , tested with versions 2.7, 3.5, 3.6 and 3.7.
NumPy for number crunching.
smart_open for transparently opening files on remote storages or compressed files.

Testing Gensim

Gensim uses continuous integration, automatically running a full test suite on each pull request with

CI service	Task	Build badge
Travis	Run tests on Linux and check code-style
AppVeyor	Run tests on Windows
CircleCI	Build documentation

Problems?

Use the Gensim discussion group for questions and troubleshooting. See the support page for commercial support.

Who is using Gensim?

Doing something interesting with Gensim? Ask to be featured here.

“Here at Tailwind, we use Gensim to help our customers post interesting and relevant content to Pinterest. No fuss, no muss. Just fast, scalable language processing.”

Waylon Flinn
Tailwind
“We are using Gensim every day. Over 15 thousand times per day to be precise. Gensim’s LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it’s all about. It simply works.”

Andrius Butkus
Issuu
“Gensim hits the sweetest spot of being a simple yet powerful way to access some incredibly complex NLP goodness.”

Alan J. Salmoni
Roistr.com
“I used Gensim at Ghent university. I found it easy to build prototypes with various models, extend it with additional features and gain empirical insights quickly. It's a reliable library that can be used beyond prototyping too.”

Dieter Plaetinck
IBCN group
“We used Gensim in several text mining projects at Sports Authority. The data were from free-form text fields in customer surveys, as well as social media sources. Having Gensim significantly sped our time to development, and it is still my go-to package for topic modeling with large retail data sets.”

Josh Hemann
Sports Authority
“Semantic analysis is a hot topic in online marketing, but there are few products on the market that are truly powerful. Gensim is undoubtedly one of the best frameworks that efficiently implement algorithms for statistical analysis. Few products, even commercial, have this level of quality.”

Bruno Champion
DynAdmic
“Based on our experience with Gensim on DML-CZ, we naturally opted to use it on a much bigger scale for similarity of fulltexts of scientific papers in the European Digital Mathematics Library. In evaluation with other approaches, Gensim became a clear winner, especially because of speed, scalability and ease of use.”

Petr Sojka
EuDML
“We have been using Gensim in several DTU courses related to digital media engineering and find it immensely useful as the tutorial material provides students an excellent introduction to quickly understand the underlying principles in topic modeling based on both LSA and LDA.”

Michael Kai Petersen
Technical University of Denmark

Forever-free open-source

Gensim is licensed under the OSI-approved GNU LGPLv2.1 license
and can be downloaded either from its Github repository or from the Python Package Index.

Get started