This documentation is best viewed on ReadTheDocs.
CoreNLP is an excellent multi-purpose NLP tool written in Java by folks at Stanford. Have a look at at the CoreNLP website. This package is meant to make it a bit easier for python users to enjoy CoreNLP, by providing two basic functionalities:
- Invoke the CoreNLP pipeline to process text from within a Python script
- Load an annotated text file into memory as an AnnotatedText object, so you can traverse dependency parses, inspect POS tags, or find coreferent mentions without munging xml.
To use corenlpy you will need to install CoreNLP itself first. Follow
these steps:
- Download CoreNLP.
- Unzip it, and move (rename) the resulting unzipped directory to
~/corenlp.pip install corenlpy
You're ready to get parsing!
By default, corenlpy looks for a CoreNLP installation at ~/corenlp.
Optionally, you can put CoreNLP somewhere else, and tell corenlpy where
to find it by writing an rc file at ~/.corenlpyrc with the following
line:
{"corenlp_path": "/path/to/the/corenlp/unzipped/dir"}OK, let's annotate some files! To run corenlp on all the files in, say,
my_dir, you would do something like this:
>>> import corenlpy
>>> corenlpy.corenlp('path/to/my_dir')By default, all the files present in my_dir will be annotated and stored
as xml files in the same directory (named by appending '.xml').
You can control how CoreNLP is run using various keyword arugments.
Specify multiple input directories, or provide an iterable of
individual files to annotate, and specify the output directory. You
can take advantage of more cores by running CoreNLP with
multiple threads. You can also choose from the available output formats
(default is 'xml'; others that are available are
'json', 'conll', 'conllu',
'text', and 'serialized', as explained here.).
The CoreNLP tool allows you to specify various annotators that perform different core NLP tasks, like tokenizing, sentence splitting, POS-tagging, lemmatization, named entity recognition, etc. You can specify which annotators you want by providing a list of them. Read about the available annotators here.
Here is an example of all these options in action:
>>> corenlpy.corenlp(
in_dirs=['path/to/dir1', 'path/to/dir2'],
in_files=['path/to/some/file1', 'path/to/some/file2'],
out_dir='path/to/output_direcoty',
annotators=['tokenize', 'ssplit', 'pos', 'lemma', 'ner', 'parse', 'dcoref'],
threads=4,
output_format='xml'
)Note that in_dirs can be a single directory or a list thereof, and
in_files can be a single file or a list thereof. When directories
are provided, CoreNLP will be invoked on all files within them.
CoreNLP has many other options that can be specified by a
"properties file"
(see subheading "Configuration"). In corenlpy, those options can be
set by passing in a dictionary, where the keys are property names, and the
values are the property values. E.g.:
>>> c.corenlp(
'path/to/my_file',
properties={'ner.model':'edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz'}
)Note that some of the keyword options we saw above can also be set in a properties file; the value in the properties file takes precedence.
The AnnotatedText class is what originally motivated the creation of
this package. If you need to work with annotation outputs from CoreNLP
in Python, this will save you a lot of time. It's best to illustrate how
it works using an example.
Suppose we have the one-sentence document:
President Obama cannot run for a third term (but I think he wants to).
Let's assume that it has been processed by CoreNLP, creating the output
file obama.txt.xml.
The first thing we do is import the module and get an AnnotatedText
object.
>>> from corenlpy import AnnotatedText as A
>>> xml = open('obama.txt.xml').read()
>>> annotated_text = A(xml)Usually you'll access parts of the document using the sentences list.
>>> len(annotated_text.sentences)
1
>>> sentence = annotated_text.sentences[0]
>>> sentence.keys()
['tokens', 'entities', 'references', 'mentions', 'root', 'id']A Sentence is a special class that, for the most part, feels like a
simple dict.
The tokens property is a list of the sentence's tokens:
>>> obama = sentence['tokens'][1]
>>> obama
' 0: Obama (10,14) NNP PERSON'
>>> term = sentence['tokens'][7]
>>> term
' 7: term (39,42) NN -'Tokens have properties corresponding to CoreNLP's annotations, plus some other stuff:
>>> obama.keys()
['word', 'character_offset_begin', 'character_offset_end', 'pos',
'lemma', 'sentence_id', 'entity_idx', 'speaker', 'mentions', 'parents',
'ner', 'id']"Obama" is the name of a person, so, if CoreNLP is working well, it should
pick that up. Named entity information is found in the ner property:
>>> obama['ner']
'PERSON'
>>> term['ner'] is None
TrueSimilarly we can check the part-of-speech:
>>> obama['pos']
'NNP'
>>> term['pos']
'NN'We can traverse the dependency tree using the parents and children
properties. In our example, "run" is the parent of "Obama"
(because "Obama" is the subject (nsubj) of "run"):
>>> relation, parent = obama['parents'][0]
>>> relation
u'nsubj'
>>> parent
' 3: run (23,25) -'If you're processing dependency trees, you'll often want to start with
the head word (which is like the root of the sentence). Sentences have a
special root property that stores the head word. Usually it's a verb:
>>> sentence['root']
' 3: run (23,25) -'A coreference chain is a series of references to the same entity. In our example, "President Obama" and "he" are each mentions from the same coreference chain. We can access all the mentions of a coreference chain.
First, we can get the mention that "Obama" is part of:
>>> first_mention = obama['mentions'][0]
>>> first_mention['tokens']
[' 0: President (0,8) -', ' 1: Obama (10,14) PERSON']Note that a token can be part of multiple mentions. For example, consider the phrase "Obama's pyjamas". If his pyjamas are mentioned multiple times, then there will be a coreference chain made for it, as well as for Obama himself. And in the phrase "Obama's pyjamas", the token "Obama" is both part of a mention corresponding to the 44th President of the United States, and part of a mention corresponding to some garments for sleeping.
Once we have gotten ahold of a mention, we can access the coreference
chain that it belongs to, which is found in the mention's 'reference'
property. Conversely, if we have accessed a coreference chain, we can
find all of its mentions by looking at its 'mentions' property.
So, starting from the mention containing the token "Obama", we can get to the other mention ("he") like this:
>>> reference = first_mention['reference']
>>> len(reference['mentions'])
2
>>> second_mention = reference['mentions'][1]
>>> second_mention['tokens']
['12: he (57,58) -']Mentions have various properties:
>>> first_mention.keys()
['head', 'end', 'reference', 'tokens', 'start', 'sentence_id']In addition to the coreference chain ('reference'), we get the id of
the sentence in which the mention is found, the list
of token objects in the mention, the slice indices
('start' and 'end') for those tokens as they occur in the
sentence's token list, and the head token of the
mention.
References have various properties too:
>>> reference.keys()
['mentions', 'id', 'representative']In addition to the mentions that are part of the coreference chain, we get an id for the coreference chain (unique on a per-article-basis), and a reference to the "representative" mention. The representative mention is the one that is deemed to have the fullest realization of the object's name. So in our example, the representative reference would be "President Obama", not "he". This is useful for getting the human-readable name to represent the coreference chain.
We can access all of the mentions or all of the coreference chains, for
a given sentence, using its mentions and references properties.
>>> len(sentence['mentions'])
2
>>> len(sentence['references'])
1One thing to note is that mentions and references aren't necessarily anchored to any named entity (though they often are). For example, consider this sentence:
The police are yet to find any suspects. They say they will continue their search.
Here, "The police", "they" (which occurs twice), and "their" are all part of one coreference chain, yet none is a named entity.
To access only mentions that are named entities, use the entities
property of the sentence.
The document as a whole also provides global mentions, references,
and entities properties which can be iterated over directly..
.. py:class:: AnnotatedText(corenlp_xml, **kwargs) Create a new AnnotatedText object. Only the first parameter is normally needed. The remaining parameters enable adding entity linking data from the AIDA software, controlling the kind of dependency parse used, and filtering the kinds of named entities, coreference chains, and mentions that are included (by default all those provided by CoreNLP are are included). :param str corenlp_xml: An xml string output by CoreNLP. :param str aida_json=None: A JSON string output by AIDA. AIDA is a program that disambiguates named entities, linking them to the YAGO knowledge base. If the JSON output of AIDA is provided, then ``entities``, ``mentions`` and ``references`` entries will be augmented with entity linking information. :param str dependencies='collapsed-ccprocessed': Determines which kind of dependencies will be used in constructing dependency trees. Three options are available: ``'collapsed-ccprocessed'`` (the default), ``'collapsed'``, and ``'basic'``. :param bool exclude_ordinal_NERs=False: Whether to recognize ordinal named entities. If ``True``, named entities of the following types will be ignored: ``'TIME'``, ``'DATE'``, ``'NUMBER'``, ``'DURATION'``, ``'PERCENT'``, ``'SET'``, ``'ORDINAL'``, and ``'MONEY'``. :param bool exclude_long_mentions=False: CoreNLP occaisionally includes mentions, as part of coreference chains, that are very long noun phrases. These mentions can be surprising and are often not useful. Setting this option to ``True`` causes any mentions longer that the value specified by ``long_mention_threshold`` to be discarded (default length is 5 tokens). :param int long_mention_threshold=5: Maximum number of tokens allowed in a coreference chain mention, above which the mention will be ignored if ``exclude_long_mentions`` is ``True``. :param bool exclude_non_ner_coreferences=False: In some cases, it is only desirable to consider those coreference chains that have at least one named entity as a mention. Setting this option to ``True`` will exclude references and their mentions if the reference includes no named entities.