Markus Konrad [email protected]
tmtoolkit
is a set of tools for text mining and topic modeling with Python. It contains functions for text preprocessing like lemmatization, stemming or POS tagging especially for English and German texts. Preprocessing is done in parallel by using all available processors on your machine. The topic modeling features include topic model evaluation metrics, allowing to calculate models with different parameters in parallel and comparing them (e.g. in order to find the optimal number of topics and other parameters). Topic models can be generated in parallel for different copora and/or parameter sets using the LDA implementations either from lda, scikit-learn or gensim.
Text preprocessing is built on top of NLTK and CLiPS pattern. Common features include:
- tokenization
- POS tagging (optimized for German and English)
- lemmatization (optimized for German and English)
- stemming
- cleaning tokens
- filtering tokens
- generating n-grams
- generating document-term-matrices
Preprocessing is done in parallel by using all available processors on your machine, greatly improving processing speed as compared to sequential processing on a single processor.
Topic models can be generated in parallel for different copora and/or parameter sets using the LDA implementations either from lda, scikit-learn or gensim. They can be evaluated and compared (for example in order to find the optimal number of topics) using several implemented metrics:
- Pair-wise cosine distance method (Cao Juan et al. 2009)
- KL divergence method (Arun et al. 2010)
- Harmonic mean method (Griffiths, Steyvers 2004)
- Probability of held-out documents (Wallach et al. 2009)
- Model coherence (Mimno et al. 2011 or with metrics implemented in Gensim)
- the loglikelihood or perplexity methods natively implemented in lda, sklearn or gensim
Further features include:
- plot evaluation results
- export estimated document-topic and topic-word distributions to Excel
- visualize topic-word distributions and document-topic distributions as word clouds (see lda_visualization Jupyter Notebook)
- visualize topic-word distributions and document-topic distributions as heatmaps (see lda_visualization Jupyter Notebook)
- integrate PyLDAVis to visualize results
- coherence for individual topcis (see model_coherence Jupyter Notebook)
The package is available on PyPI and can be installed via Python package manager pip:
# recommended installation
pip install -U tmtoolkit[excel_export,plotting,wordclouds]
# minimal installation:
pip install -U tmtoolkit
The package is about 13MB big, because it contains some additional German language model data for POS tagging.
Upgrade notice: If upgrading from an older version to 0.6.0 or above, you will need to uninstall tmtoolkit first (run pip uninstall tmtoolkit
), before re-installing (using one of the commands described above).
PyPI packages which can be installed via pip are written italic.
- for improved lemmatization of German texts: Pattern
- for plotting/visualizations: matplotlib
- for the word cloud functions: wordcloud and Pillow
- for Excel export: openpyxl
- for topic modeling, one of the LDA implementations: lda, scikit-learn or gensim
- for additional topic model coherence metrics: gensim
For LDA evaluation metrics griffiths_2004
and held_out_documents_wallach09
it is necessary to install gmpy2 for multiple-precision arithmetic. This in turn requires installing some C header libraries for GMP, MPFR and MPC. On Debian/Ubuntu systems this is done with:
sudo apt install libgmp-dev libmpfr-dev libmpc-dev
After that, gmpy2 can be installed via pip.
So for the full set of features, you should run the following (optionally adding gmpy2 if you have installed the above requirements):
pip install -U Pattern matplotlib wordcloud Pillow openpyxl lda scikit-learn gensim
tmtoolkit
works with Python 3.5 or above.
Requirements are automatically installed via pip. Additional packages can also be installed via pip for certain use cases (see optional packages).
A special note for Windows users: tmtoolkit has been tested on Windows and works well (I recommend using the Anaconda distribution for Python there). However, you will need to wrap all code that uses multi-processing (i.e. all calls to TMPreproc
and the parallel topic modeling functions) in a if __name__ == '__main__'
block like this:
def main():
# code with multi-processing comes here
# ...
if __name__ == '__main__':
main()
See the examples
directory.
- six
- NumPy
- SciPy
- NLTK
- Pandas
- Pyphen
Please note: You will need to install several corpora and language models from NLTK if you didn't do so yet. Run python -c 'import nltk; nltk.download()'
which will open a graphical downloader interface. You will need at least the following data packages:
averaged_perceptron_tagger
punkt
stopwords
wordnet
Documentation for many methods is still missing at the moment and will be added in the future. For the moment, you should have a look at the examples below and in the examples
directory.
Some examples that can be run directly in an IPython session:
We will process as small, self-defined toy corpus with German text. It will be tokenized, cleaned and transformed into a document-term-matrix. You will need to wrap this in a if __name__ == '__main__'
block if you're using Windows. See "Special note for Windows users" above.
from tmtoolkit.preprocess import TMPreproc
# a small toy corpus in German, here directly defined as a dict
# to load "real" (text) files use the methods in tmtoolkit.corpus
corpus = {
'doc1': 'Ein einfaches Beispiel in einfachem Deutsch.',
'doc2': 'Es enthält nur drei sehr einfache Dokumente.',
'doc3': 'Die Dokumente sind sehr kurz.',
}
# initialize
preproc = TMPreproc(corpus, language='german')
# run the preprocessing pipeline: tokenize, POS tag, lemmatize, transform to
# lowercase and then clean the tokens (i.e. remove stopwords)
preproc.tokenize().pos_tag().lemmatize().tokens_to_lowercase().clean_tokens()
print(preproc.tokens)
# this will output:
# {'doc1': ('einfach', 'beispiel', 'einfach', 'deutsch'),
# 'doc2': ('enthalten', 'drei', 'einfach', 'dokument'),
# 'doc3': ('dokument', 'kurz')}
print(preproc.tokens_with_pos_tags)
# this will output:
# {'doc1': [('einfach', 'ADJA'),
# ('beispiel', 'NN'),
# ('einfach', 'ADJA'),
# ('deutsch', 'NN')],
# 'doc2': [('enthalten', 'VVFIN'),
# ('drei', 'CARD'),
# ('einfach', 'ADJA'),
# ('dokument', 'NN')],
# 'doc3': [('dokument', 'NN'), ('kurz', 'ADJD')]}
# generate sparse DTM and print it as a data table
doc_labels, vocab, dtm = preproc.get_dtm()
import pandas as pd
print(pd.DataFrame(dtm.todense(), columns=vocab, index=doc_labels))
We will use the lda package for topic modeling. Several models for different numbers of topics and alpha values are generated and compared. The best is chosen and the results are printed.
from tmtoolkit.topicmod import tm_lda
import lda # for the Reuters dataset
import matplotlib.pyplot as plt
plt.style.use('ggplot')
doc_labels = lda.datasets.load_reuters_titles()
vocab = lda.datasets.load_reuters_vocab()
dtm = lda.datasets.load_reuters()
# evaluate topic models with different parameters
const_params = dict(n_iter=100, random_state=1) # low number of iter. just for showing how it works here
varying_params = [dict(n_topics=k, alpha=1.0/k) for k in range(10, 251, 10)]
# this will evaluate 25 models (with n_topics = 10, 20, .. 250) in parallel
models = tm_lda.evaluate_topic_models(dtm, varying_params, const_params,
return_models=True)
# plot the results
# note that since we used a low number of iterations, the plot looks quite "unstable"
# for the given metrics.
from tmtoolkit.topicmod.visualize import plot_eval_results
from tmtoolkit.topicmod.evaluate import results_by_parameter
results_by_n_topics = results_by_parameter(models, 'n_topics')
plot_eval_results(results_by_n_topics)
plt.show()
# the peak seems to be around n_topics == 140
from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words, print_ldamodel_doc_topics
best_model = dict(results_by_n_topics)[140]['model']
print_ldamodel_topic_words(best_model.topic_word_, vocab)
print_ldamodel_doc_topics(best_model.doc_topic_, doc_labels)
More examples can be found in the examples
directory.
The provided samples in examples/data
come from:
bt18_sample_1000.pickle
: offenesparlament.degutenberg
: Project Gutenberg
Code licensed under Apache License 2.0. See LICENSE
file.