How to instal & setup Spark NLP in high-compliance environments with no internet connection.

Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 5 million times and experiencing 16x growth for the last 16 months, Spark NLP is used…

Converting natural language questions to SQL queries on scale

SQL is still one of the most sought-after skills in the industry (image from edX)

The amount of data produced daily has been increasing exponentially since the start of the new millennia. Most of this data is stored in relational databases. In the past, access to this data has been the interest of mostly large companies, who are able to query the data using structured query languages (SQL). With the growth of mobile phones, more and more personal data is being stored. Thus, more and more people from different backgrounds are trying to query and use their own data. Despite the meteoric rise in the popularity of data science, most people do not have adequate…

The first end-to-end pretrained models and pipelines to detect Adverse Drug Reactions on scale with the help of Spark NLP and BioBert.

Photo by National Cancer Institute on Unsplash

Adverse Drug Reactions (ADRs) or Adverse Drug Events (ADEs) are potentially very dangerous to patients and are amongst the top causes of morbidity and mortality [1]. Many ADRs are hard to discover as they happen to certain groups of people in certain conditions and they may take a long time to expose. Healthcare providers conduct clinical trials to discover ADRs before selling the products but normally are limited in numbers. …

Training a SOTA multi-class text classifier with Bert and Universal Sentence Encoders in Spark NLP with just a few lines of code in less than 10 min.

Photo by AbsolutVision on Unsplash

Natural language processing (NLP) is a key component in many data science systems that must understand or reason about a text. Common use cases include text classification, question answering, paraphrasing or summarising, sentiment analysis, natural language BI, language modeling, and disambiguation.

NLP is essential in a growing number of AI applications. Extracting accurate information from free text is a must if you are building a chatbot, searching through a patent database, matching patients to clinical trials, grading customer service or sales calls, extracting facts from financial reports or solving for any of these 44 use cases across 17 industries.


Training a NER with BERT with a few lines of code in Spark NLP and getting SOTA accuracy.

Photo by Jasmin Ne on Unsplash

NER is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Faster inference in runtime from Spark NLP pipelines

Photo by Rui Xu on Unsplash

This is the second article in a series in which we are going to write a separate article for each annotator in the Spark NLP library. You can find all the articles at this link.

This article is mainly built on top of Introduction to Spark NLP: Foundations and Basic Components (Part-I). Please read that at first, if you want to learn more about Spark NLP and its underlying concepts.

In machine learning, it is common to run a sequence of algorithms to process and learn from data. …

In this series, we are going to write a separate article for each annotator in the Spark NLP library and this is the first one.

Photo by Joshua Sortino on Unsplash

In our first article, remember that we talked about certain types of columns that each Annotator accepts or outputs. So, what are we going to do if our DataFrame doesn’t have columns in those types? Here come transformers.

In Spark NLP, we have five different transformers that are mainly used for getting the data in or transform the data from one AnnotatorType to another.

That is, the DataFrame you have needs to have a column…

This is the second article in a series of blog posts to help Data Scientists and NLP practitioners learn the basics of Spark NLP library from scratch and easily integrate it into their workflows.

This blogpost covers installation on MacOS and Linux machines, but I also shared the links below for Windows and Docker.

Photo by Kevin Horvat on Unsplash

In our first article, we made a nice intro to Spark NLP and its basic components and concepts. If you haven’t read the first part yet, please read that at first.

Spark NLP is an open-source natural language processing library, built on top of Apache Spark…

* This is the first article in a series of blog posts to help Data Scientists and NLP practitioners learn the basics of Spark NLP library from scratch and easily integrate it into their workflows. During this series, we will do our best to produce high-quality content and clear instructions with accompanying codes both in Python and Scala regarding the most important features of Spark NLP. Through these articles, we aim to make the underlying concepts of Spark NLP library as clear as possible by touching all the practical and pain points with codes and instructions. The ultimate goal is…

Veysel Kocaman

Senior Data Scientist and PhD Researcher in ML

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store