One of the great things about using Python for natural language processing (NLP) is the large ecosystem of tools and libraries. From tokenization, to machine learning, to data visualization—Python has something for every NLP task in your workflow. Of course, choosing the right tool isn’t always so easy. Every NLP library provides slightly different functionality and has slightly different implementation. The key to finding the right tool is having an awareness about what is out there, and experimenting with each of them such that you know each tool’s strengths and weaknesses. To that end, provided below is a list of the major NLP tools in use today. We recommend you try them all out—if only to play around and see how they work.
Core NLP Tasks
Deconstructing text into machine-interpretable form, be it a bag-of-words, a matrix, or some other form is a critical part of the NLP pipeline. The below libraries provide various mechanisms for these core NLP tasks.
Gensim is an open-source Python library used for a variety of tasks, including: topic modeling; indexing; and, document similarity. Gensim has functionality for latent semantic analysis, non-negative matrix factorization, latent Dirichlet allocation (LDA), and term frequency-inverse document frequency (TF-IDF) matrices. Gensim comes packaged with built-in models based on fastText, word2vec, and doc2vec algorithms.
The Natural Language Toolkit (NLTK) is one of the oldest and most well-known NLP libraries. NLTK includes features such as: tokenization; text similarity; named entity recognition; n-grams; parts-of-speech; treebanks; and, more. NLTK is the go-to library for computational linguistics and other scientific fields, but it has wide application in business and personal projects as well.
Pattern is an NLP library that describes itself as a “web mining module for Python”, but in reality it does much more than help you mine text. Pattern provides facilities for common text manipulation tasks like tokenization, parts-of-speech, and chunking. In addition, Pattern has built-in classes to work with APIs like Google, Twitter, and Wikipedia, among others. This makes Pattern one of the most versatile tools for any NLP practitioner.
Like many of the other tools on this list, Polyglot provides basic NLP text manipulation functionality like tokenization, named entity recognition, and built-in models. What really makes Polyglot special though is its language detection ability. Many NLP tools are biased towards English-language text, which is what makes Polyglot such a useful and innovative option for non-English NLP tasks.
PyNLPl, known affectionately as ‘pineapple’, is similar to other text manipulation tools, in that it has basic functionality like n-gram extraction and built-in language models, but where it really excels is when you need to parse less common file formats (FoLiA, Giza, Moses, ARPA, Timbl, CQL). With a built-in formats module, PyNLPl makes working with such files a breeze.
When it comes to NLP applications in production, spaCy is arguably the king. spaCy aims to provide developers with a framework for NLP tasks that integrates seamlessly with deep learning tools and works in production out of the box. Unlike some of the science and research-oriented tools on this list, spaCy is specifically built to be used in production applications. And since spaCy is built in a mix of Python and Cython, it has all the speed you could need. Whether you’re interested in basic tasks like tokenization, pre-trained word vector models, or built-in visualization capacity, spaCy has you covered.
Stanza is a new tool in the Python world, but the folks at Stanford NLP Group are far frome newcomers. Stanza brings the power of the Stanford CoreNLP Java package to the Python world, to include Python bindings to a CoreNLP Client, but it also provides native Python implementations of basic NLP tasks. Plus, Stanza has multilingual support for 66 langauges, with pre-trained neural models to support each of them.
Textacy is built on top of spaCy and delegates basic parsing tasks to spaCy’s models. Textacy differentiates itself by providing functionality for the tasks that come before and after the processing that spaCy does, such as n-gram extraction, chunking, named entity recognition, and more. Plus, Textacy has built-in datasets and a few unique functions like readability statistics that are missing in many other libraries.
Processing textual data is a critical part of every NLP job, and Textblob can help you in just about every step of processing. With features for parts-of-speech, sentiment analysis, n-grams, tokenziation, and classification, Textblob is a versatile library while remaining easy to use.
In the age of machine learning, processing text is often just the first step in an NLP workflow. The next step is using your data to classify documents, make predictions, or carry out any other number of artificial intelligence tasks. The number of machine learning libraries out there seems to grow with each passing day, but the below libraries are proven winners in the field.
The folks at fast.ai are widely known for their free MOOCs, but their library for training neural networks is top notch as well. The fastai library is built on top of pyTorch and provides high-level components that make the process of building deep learning applications both approachable and productive. fastai is also optimized for GPUs to make the training process even faster.
Keras is an approachable API for deep learning that runs on top of TensorFlow. The API provides practitioners with built-in data structures (layers and models) for common deep learning tasks, to include by optimizing for use with a TPU or GPUs. Keras is particularly popular in the scientific and research world, but can be used by NLP practitioners at any level.
PyTorch is a machine learning framework meant for everyone from research scientists to budding startups. Whether you’re carrying out deep learning tasks locally or in the cloud, on a CPU or on GPUs, PyTorch can help you along the way. Arguably the best thing about PyTorch is its massive ecosystem of add-on libraries that extend PyTorch’s functionality into specific fields, including NLP.
When it comes to the big names in machine learning, scikit-learn is near the top. scikit-learn is built on top of popular scientific computing libraries NumPy and SciPy and facilitates machine learning tasks ranging from classification to prediction. scikit-learn comes with a large variety of built-in models and algorithms to help you build machine learning applications quickly and efficiently. Plus, scikit-learn’s popularity means that there is a robust user community and plenty of learning material available online.
TensorFlow is Google’s addition to the world of open source machine learning tools. Tensorflow lets you train neural networks for a variety of tasks and comes with a large ecosystem of community add-ons for your specific needs. And since Tensorflow is built by Google, it integrates easily with the rest of Google’s products, which makes deployment in production easy (presuming you’re using Google’s products).
Processing text data and building neural networks is all well and good, but if you can’t communicate your insights with anyone than there isn’t much point. That’s why every NLP practitioner should have a few presentation tools in their toolbox.
One of the most common ways to communicate NLP insights is through dashboards. As you might guess from the name, Dash lets you do exactly that. Dash is built on top of Flask and React.js and lets users build dashboards out of re-usable components. And since Dash is made by the good folks at Plotly, it integrates perfectly with their open source graphing libraries, meaning that you can construct modern graphical dashboards quickly and easily. Dash is free for open source users and provides paid plans for enterprise users.
If a dashboard doesn’t quite fit your needs and you find yourself wanting a full-blown web app, then Flask may be for you. Flask is an extendable web application framework that can be used for anything from a simple blog all the way up through a production-ready application with user authentication, database connections, and more. And since Flask has been around for a while, there is an immense amount of learning material available online.
Jupyter isn’t strictly a Python library, but we would be remiss if we didn’t mention perhaps the most popular presentation mechanism out there for data science projects. Jupyter notebooks provide both local- and cloud-hosted methods for writing both code and text to analyze data. And with its Visual Studio Code extension, Jupyter’s IPython shell lets you write code to explore your data and view the results in one place.
A key part of effective data presentation is creating engaging visualizations. NLP may be all about text, but often the best way to communicate insights is through graphical form. Python has a large number of data visualization tools, some of which are specifically built for visualizing textual relationships.
Altair is a declarative visualization library with a simple API that lets you plot a large variety of data types. Altair is built on top of the Vega-Lite specification, which means that you can quickly render and explore interactive data in a variety of environments, including notebooks and applications.
Bokeh is a powerful open source visualization library that focuses on interactivty. As a result, you can use Bokeh to plot data in applications, dashboards, and notebooks, and then your user can manipulate the visualizations directly. This makes for a much stronger data-user relationship.
Of all the data visualization libraries, Matplotlib is perhaps the most storied. matplotlib lets you produce production quality visualizations in either static or interactive form. And since matplotlib is open source, it can be customized and extended to meet your needs. matplotlib also has a passionate community of supporters, which makes learning the basics both fast and enjoyable.
When it comes to visualizing the relationship between terms in textual data, Scattertext is unique. Unlike some of the other visualization libraries described here, Scattertext is specifically meant for textual data. As a result, it provides functionality that is unavailable elsewhere, such as mapping term frequencies onto an x-y axis in order to visualize term categories.
Seaborn is a statistical data visualization library built on top of matplotlib. Although the two libraries are similar in many respects, Seaborn shines when used in conjunction with Pandas dataframes. As a result, Seaborn can be easier to manipulate and quickly build engaging visualizations for your data.
pyLDAvis is a unique visualization library used for a very particular task: topic modeling. As the name implies, pyLDAvis specializes in topic models built using LDA. pyLDAvis produces an interactive graphic that can be used to explore topics in a notebook or in a browser, which makes the process of analyzing topic models easy and efficient.
Word clouds can be great ways to quickly visualize term frequencies in textual data, which is exactly what the appropriately named wordcloud package lets you do. The library takes a bit more setup than other Python packages, but the resulting visualizations are engaging and attractive.
As with other fields of data science, NLP projects require lots of data. One of the best places to find that data is on the web, which makes knowledge of a good web scraper critical for any NLP practitioner.
When it comes to web scraping in Python, there is one undisputed champion—BeautifulSoup. BeautifulSoup lets you easily walk through HTML and XML documents in order to pull out the data you need. And since web scraping is such an important task in NLP, finding resources to help you learn BeautifulSoup syntax is as easy as a quick web search.
Phew! That was a lot of Python libraries! The good news is, you don’t have to master all of them at any given time. Different tools are useful in different situations, which is why we recommend trying the above tools one-by-one. As you get to know them, you’ll develop an intuition for which tool is best for any given project.