Alex Minnaar
Machine Learning at University College London. Research Engineer at Nitro.



Recent Posts

Implementing the DistBelief Deep Neural Network Training Framework with Akka

Word2Vec Tutorial Part II: The Continuous Bag-of-Words Model

Word2Vec Tutorial Part I: The Skip-Gram Model

Distributed Online Latent Dirichlet Allocation with Apache Spark

Deep Learning Basics: Neural Networks, Backpropagation and Stochastic Gradient Descent

Building a Shoutbox App with Cassandra and Node.js

Building a Distributed Binary Search Tree with Akka

Introduction to the Multithreading Problem and the Akka Actor Solution

ScalaNER: A Scala Wrapper for the Stanford NER Tool with Some Added Features

Online Latent Dirichlet Allocation - The Best Option for Topic Modeling with Large Data Sets

Latent Dirichlet Allocation in Scala Part II - The Code

ScalaNER: A Scala Wrapper for the Stanford NER Tool with Some Added Features

The Stanford NER (named entity recognizer) tool is a widely-used, general purpose named entity recognition tool that Stanford has made available as part of its CoreNLP Java library. It performs named entity recognition via a CRF-based sequence model which has been known to give near state-of-the-art performance results which makes it a popular choice for open-source NER tools.

Having said that, I have used this tool in the past and I was left wanting more functionality. From its FAQ section, you can see that most of its functionality (i.e. training and testing a NER model) is designed to be performed using the command line. But what if you want to use a pre-trained NER model as part of a real-time text processing pipeline? For example, you are processing a string of text and you want to apply your NER model to the text and then do something with the tokens corresponding to classified named entities. There is no clear way to do this with the Stanford NER tool.

I have also found that the Stanford NER tool is lacking in its model validation functionality. Just like any classification model, I want to be able to perform cross-validation tests on my training data so that I can be confident in its generalized performance. Again, there is no clear way of doing this. You can test your model on a test set and obtain the precision, recall and F1 values but unfortunately these values are just shown in standard output and there is no way to persist them. Consequently, if you wanted to perform 50-fold cross-validation on your dataset you would have to visually read each of the 50 sets of performance metrics off the screen and then manually average them to get your desired result (or export standard output and parse it). Obviously no one wants to do this.

ScalaNER attempts to solve these problems by offering the following additional functionality to the Stanford NER tool.

  • Programmatically apply a pre-trained NER model to a string of text and output the labelled result.
  • Programmatically train an NER model.
  • Easy model validation. Specifically cross-validation.

ScalaNER Demo

The following code samples demonstrate this new functionality. First of all, it should be noted that the training data sets must be in the same format that the Stanford NER tool accepts. That is, each line must contain a tab-separated token/label pair. Entity labels can be any string but non-entity labels must be "O". For example, training data for a person name entity might look like

The    O
US    O
president    O
is    O
Barrack    NAME
Obama    NAME

where named entities are labelled as "NAME" and non-entity tokens are labelled as "O".

Train an NER Model

First we will demonstrate how to train a NER model given a training dataset in the format explained above. The code is very simple - in fact it is only one line. It uses a Scala object called NERModel. To train an NER model you simply call this object's trainClassifier method which takes two arguments, the location of the training data file (it must be a text file) and the filename and location where the the trained NER model will be saved.

NERModel.trainClassifier("my/data/location.txt", "save/my/model/here.ser.gz")

Apply an NER Model

Then once you have trained your NER model you will probably want to apply this model to some new text. To do this we use the ApplyModel class which takes the location of the trained model as a constructor. Once this class has been instantiated, we call its runNER method which takes a string as an input argument. This input string is the text from which you want to extract the named entities. The result is an indexed sequence of LabeledToken objects which contain a token field and a label field. The token fields contain the tokens in the input string and the label fields contain the named entities that the tokens have been assigned to.

val classifier=new ApplyModel("my/pretrained/model.ser.gz")

val results=classifier.runNER("Find named entities in this new sentence.")

Performing Cross-Validation on an NER Model

To perform cross-validation we use the CrossValidation class which takes the number of folds and training data location as constructors. Then we call the runCrossValidation method with an input parameter that is the location of the directory where the training and validation sets will be written. The result is a vector whose elements correspond to the number of folds. Each element is a map whose keys represent the unique entity types in that fold and values represent the precision, recall and F1-score of the corresponding entity type.

val testInstance = new CrossValidation(5, "location/of/training/data.txt")

val xvalResults=classifier.runCrossValidation("directory/to/write/xvalidation/data/")

Next let's look at a real-world example.

Example: Identifying Protein Names

Suppose that you wanted to train an NER model to identify protein named in bio-medical literature. We will use the BioNLP dataset that has already been transformed into the correct Stanford NER format which can be found in the ScalaNER github repo.

First let's try training an NER model with this data and running it on a sample string of text to determine if it contains any protein names.

NERModel.trainClassifier("data/bionlp.txt", "/bioNlpModel.ser.gz")

val classifier=new ApplyModel("/bioNlpModel.ser.gz")

val results=classifier.runNER("Leukotriene B4 stimulates c-fos and c-jun gene transcription and AP-1 binding activity in human monocytes.")


Which gives the following output

Vector(LabeledToken(Leukotriene,O), LabeledToken(B4,O), LabeledToken(stimulates,O), LabeledToken(c-fos,protein), LabeledToken(and,O), LabeledToken(c-jun,protein), LabeledToken(gene,O), LabeledToken(transcription,O), LabeledToken(and,O), LabeledToken(AP-1,O), LabeledToken(binding,O), LabeledToken(activity,O), LabeledToken(in,O), LabeledToken(human,O), LabeledToken(monocytes,O), LabeledToken(.,O))

As you can see, the trained model assigns the correct protein label to the tokens "c-fos" and "c-jun" and all other tokens are assigned the O label indicating that they are not named entities.

Next, let's perform 5-fold cross-validation on the entire dataset to get a good idea of its generalized performance. This can be done in the following code where we specify the folder "data/xval" to be location where the 5 training and validation sets will be written.

val cv = new CrossValidation(5, "data/bionlp.txt")

val testResults = cv.runCrossValidation("data/xval")


Which gives the following output

Vector(Map(protein -> Performance(0.680461329715061,0.9862340216322517,0.8052990766760337), O -> Performance(0.999634483838964,0.9878479836941098,0.9937062846316554)), Map(O -> Performance(0.9991162403826159,0.9858425237240318,0.9924350003872866), protein -> Performance(0.5766871165644172,0.9567430025445293,0.7196172248803828)), Map(O -> Performance(0.9986442092089483,0.9858183409260546,0.9921898273472612), protein -> Performance(0.6125175808720112,0.9436619718309859,0.7428571428571428)), Map(O -> Performance(0.9994266652767643,0.9878419452887538,0.9936005389019872), protein -> Performance(0.6638176638176638,0.9769392033542977,0.7905004240882104)), Map(O -> Performance(0.9988831168831169,0.9877484974572354,0.9932846036624738), protein -> Performance(0.6261755485893417,0.9489311163895487,0.7544853635505192)))

The above output shows the precision, recall and F1-scores for each entity type (in this case protein and O) and each of the 5 folds. So the F1-scores associated with identifying protein named entities are 0.8052, 0.7196, 0.7428, 0.7905, and 0.7544 for an average F1-score of 0.7625.