tmtoolkit – Text Mining and Topic Modeling Toolkit for Python

tmtoolkit is a set of tools for text mining and topic modeling with Python. It contains functions for text preprocessing like lemmatization, stemming or POS tagging especially for English and German texts. Preprocessing is done in parallel by using all available processors on your machine. The topic modeling features include topic model evaluation metrics, allowing to calculate models with different parameters in parallel and comparing them (e.g. in order to find the optimal number of topics and other parameters). Topic models can be generated in parallel for different copora and/or parameter sets using the LDA implementations either from lda, scikit-learn or gensim.

Features

Text preprocessing

Text preprocessing is built on top of NLTK and CLiPS pattern. Common features include:

tokenization
POS tagging (optimized for German and English)
lemmatization (optimized for German and English)
stemming
cleaning tokens
filtering tokens
generating n-grams
generating document-term-matrices

Preprocessing is done in parallel by using all available processors on your machine, greatly improving processing speed as compared to sequential processing on a single processor.

Topic modeling

Topic models can be generated in parallel for different copora and/or parameter sets using the LDA implementations either from lda, scikit-learn or gensim. They can be evaluated and compared (for example in order to find the optimal number of topics) using several implemented metrics:

Pair-wise cosine distance method (Cao Juan et al. 2009)
KL divergence method (Arun et al. 2010)
Harmonic mean method (Griffiths, Steyvers 2004)
Probability of held-out documents (Wallach et al. 2009)
Model coherence (Mimno et al. 2011 or with metrics implemented in Gensim)
the loglikelihood or perplexity methods natively implemented in lda, sklearn or gensim

Further features include:

plot evaluation results
export estimated document-topic and topic-word distributions to Excel
visualize topic-word distributions and document-topic distributions as word clouds (see lda_visualization Jupyter Notebook)
visualize topic-word distributions and document-topic distributions as heatmaps (see lda_visualization Jupyter Notebook)
integrate PyLDAVis to visualize results
coherence for individual topcis (see model_coherence Jupyter Notebook)

Installation

The package is available on PyPI and can be installed via Python package manager pip:

# recommended installation
pip install -U tmtoolkit[excel_export,plotting,wordclouds]

# minimal installation:
pip install -U tmtoolkit

The package is about 13MB big, because it contains some additional German language model data for POS tagging.

Upgrade notice: If upgrading from an older version to 0.6.0 or above, you will need to uninstall tmtoolkit first (run pip uninstall tmtoolkit), before re-installing (using one of the commands described above).

Optional packages

PyPI packages which can be installed via pip are written italic.

for improved lemmatization of German texts: Pattern
for plotting/visualizations: matplotlib
for the word cloud functions: wordcloud and Pillow
for Excel export: openpyxl
for topic modeling, one of the LDA implementations: lda, scikit-learn or gensim
for additional topic model coherence metrics: gensim

For LDA evaluation metrics griffiths_2004 and held_out_documents_wallach09 it is necessary to install gmpy2 for multiple-precision arithmetic. This in turn requires installing some C header libraries for GMP, MPFR and MPC. On Debian/Ubuntu systems this is done with:

sudo apt install libgmp-dev libmpfr-dev libmpc-dev

After that, gmpy2 can be installed via pip.

So for the full set of features, you should run the following (optionally adding gmpy2 if you have installed the above requirements):

pip install -U Pattern matplotlib wordcloud Pillow openpyxl lda scikit-learn gensim

Requirements

tmtoolkit works with Python 3.5 or above.

Requirements are automatically installed via pip. Additional packages can also be installed via pip for certain use cases (see optional packages).

A special note for Windows users: tmtoolkit has been tested on Windows and works well (I recommend using the Anaconda distribution for Python there). However, you will need to wrap all code that uses multi-processing (i.e. all calls to TMPreproc and the parallel topic modeling functions) in a if __name__ == '__main__' block like this:

def main():
    # code with multi-processing comes here
    # ...

if __name__ == '__main__':
    main()

See the examples directory.

Required packages

six
NumPy
SciPy
NLTK
Pandas
Pyphen

Please note: You will need to install several corpora and language models from NLTK if you didn't do so yet. Run python -c 'import nltk; nltk.download()' which will open a graphical downloader interface. You will need at least the following data packages:

averaged_perceptron_tagger
punkt
stopwords
wordnet

Documentation

Documentation for many methods is still missing at the moment and will be added in the future. For the moment, you should have a look at the examples below and in the examples directory.

Examples

Some examples that can be run directly in an IPython session:

Preprocessing

We will process as small, self-defined toy corpus with German text. It will be tokenized, cleaned and transformed into a document-term-matrix. You will need to wrap this in a if __name__ == '__main__' block if you're using Windows. See "Special note for Windows users" above.

from tmtoolkit.preprocess import TMPreproc

# a small toy corpus in German, here directly defined as a dict
# to load "real" (text) files use the methods in tmtoolkit.corpus
corpus = {
    'doc1': 'Ein einfaches Beispiel in einfachem Deutsch.',
    'doc2': 'Es enthält nur drei sehr einfache Dokumente.',
    'doc3': 'Die Dokumente sind sehr kurz.',
}

# initialize
preproc = TMPreproc(corpus, language='german')

# run the preprocessing pipeline: tokenize, POS tag, lemmatize, transform to
# lowercase and then clean the tokens (i.e. remove stopwords)
preproc.tokenize().pos_tag().lemmatize().tokens_to_lowercase().clean_tokens()

print(preproc.tokens)
# this will output: 
#  {'doc1': ('einfach', 'beispiel', 'einfach', 'deutsch'),
#   'doc2': ('enthalten', 'drei', 'einfach', 'dokument'),
#   'doc3': ('dokument', 'kurz')}

print(preproc.tokens_with_pos_tags)
# this will output: 
# {'doc1': [('einfach', 'ADJA'),
#   ('beispiel', 'NN'),
#   ('einfach', 'ADJA'),
#   ('deutsch', 'NN')],
# 'doc2': [('enthalten', 'VVFIN'),
#   ('drei', 'CARD'),
#   ('einfach', 'ADJA'),
#   ('dokument', 'NN')],
#  'doc3': [('dokument', 'NN'), ('kurz', 'ADJD')]}

# generate sparse DTM and print it as a data table
doc_labels, vocab, dtm = preproc.get_dtm()

import pandas as pd
print(pd.DataFrame(dtm.todense(), columns=vocab, index=doc_labels))

Topic modeling

We will use the lda package for topic modeling. Several models for different numbers of topics and alpha values are generated and compared. The best is chosen and the results are printed.

from tmtoolkit.topicmod import tm_lda
import lda  # for the Reuters dataset

import matplotlib.pyplot as plt
plt.style.use('ggplot')

doc_labels = lda.datasets.load_reuters_titles()
vocab = lda.datasets.load_reuters_vocab()
dtm = lda.datasets.load_reuters()

# evaluate topic models with different parameters
const_params = dict(n_iter=100, random_state=1)  # low number of iter. just for showing how it works here
varying_params = [dict(n_topics=k, alpha=1.0/k) for k in range(10, 251, 10)]

# this will evaluate 25 models (with n_topics = 10, 20, .. 250) in parallel
models = tm_lda.evaluate_topic_models(dtm, varying_params, const_params,
                                      return_models=True)

# plot the results
# note that since we used a low number of iterations, the plot looks quite "unstable"
# for the given metrics.
from tmtoolkit.topicmod.visualize import plot_eval_results
from tmtoolkit.topicmod.evaluate import results_by_parameter

results_by_n_topics = results_by_parameter(models, 'n_topics')
plot_eval_results(results_by_n_topics)
plt.show()

# the peak seems to be around n_topics == 140
from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words, print_ldamodel_doc_topics

best_model = dict(results_by_n_topics)[140]['model']
print_ldamodel_topic_words(best_model.topic_word_, vocab)
print_ldamodel_doc_topics(best_model.doc_topic_, doc_labels)

More examples can be found in the examples directory.

Example data

The provided samples in examples/data come from:

bt18_sample_1000.pickle: offenesparlament.de
gutenberg: Project Gutenberg

License

Code licensed under Apache License 2.0. See LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 222 Commits
ClassifierBasedGermanTagger		ClassifierBasedGermanTagger
examples		examples
tests		tests
tmtoolkit		tmtoolkit
.gitignore		.gitignore
AUTHORS.md		AUTHORS.md
CHANGES.md		CHANGES.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tmtoolkit – Text Mining and Topic Modeling Toolkit for Python

Features

Text preprocessing

Topic modeling

Installation

Optional packages

Requirements

Required packages

Documentation

Examples

Preprocessing

Topic modeling

Example data

License

About

Releases

Packages

Languages

License

abcnow/tmtoolkit

Folders and files

Latest commit

History

Repository files navigation

tmtoolkit – Text Mining and Topic Modeling Toolkit for Python

Features

Text preprocessing

Topic modeling

Installation

Optional packages

Requirements

Required packages

Documentation

Examples

Preprocessing

Topic modeling

Example data

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages