Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 40 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
# PolNeAR v.1.0.0 - Political News Attribution Relations Corpus

PolNeAR is a corpus of news articles in which _attributions_ have been
annotated. An attribution occurrs when an article cites statements, or
annotated. An attribution occurs when an article cites statements, or
describes the internal state (thoughts, intentions, etc.) of some person or
group. A direct verbatim quote is an example of attribution, as is the paraphrasing of a source's intentions or beliefs.

## Benefits

In 2018, PolNeAR is the largest attribution dataset by total number of
annotated attribution relations. It is also, based on analysis described in
[1], the most _complete_ attribution corpus, in the sense of having high
Expand All @@ -24,6 +25,7 @@ See the section entitled "Accompanying software" at the end of this README for
details.

## News Publishers

PolNeAR consists of news articles from 7 US national news publishers \*:

- Huffington Post (`huff-post`)
Expand All @@ -49,8 +51,8 @@ publisher, and candidate of focus in the article, as follows.

1. **Publisher**: 144 articles were sampled from each publisher.

2. **Time**: 84 articles were sampled uniformly from each 12 month-long
period between 8-Nov-2015 to 8-Nov-2016.
2. **Time**: 84 articles were sampled uniformly from each of 12 separate month-long
periods between 8-Nov-2015 to 8-Nov-2016.

3. **Focal Candidate**: 504 articles were respectively sampled from
articles mentioning Trump or Clinton the weak majority of the time. A
Expand All @@ -67,6 +69,7 @@ total of 1008 articles.


## Genre

We endeavored to include only the hard news genre, and to exclude soft news,
and other genres such as editorials, real estate, travel, advice, letters,
obituaries, reviews, essays, etc.
Expand All @@ -81,14 +84,16 @@ Breitbart.


## Train, Dev, Test splits
PolNeAR is split into training, development, testing subsets. The analyst

PolNeAR is split into training, development, and testing subsets. The analyst
should avoid viewing the dev and test subsets, and should only test a model
architecture once on the test set. The train subset includes all articles from
the first 10 month-long periods of coverage. The dev and test subsets include
respectively articles drawn from the 11th and 12th month.
the first 10 month-long periods of coverage. The dev and test subsets include,
respectively, articles drawn from the 11th and 12th month.


## Statistics

<pre>
==========================================================
&#35 Articles, core dataset | 1008 |
Expand All @@ -109,10 +114,10 @@ the corpus

## Data File Structure

The PolNeAR data resides under the /data directory. There is one subdirectory
The PolNeAR data resides under the [`PolNeAR/data`](data) directory. There is one subdirectory
for each _compartment_ of the dataset. There are 5 compartments. Three of the
compartments correspond to the core dataset's train/test/dev subsets. The
other two relate to quality control during annotation. The /data directory
other two relate to quality control during annotation. The [`PolNeAR/data`](data) directory
also contains a file called metadata.tsv, which provides a listing of all the
news articles along with metadata, including which annotators have annotated
it.
Expand Down Expand Up @@ -166,49 +171,55 @@ original text files, which should be obtained from the Penn Treebank 2 corpus.
## Preprocessing

## Annotation

The annotation of attributions was performed manually by 6 trained annotators,
who each annotated approximately 168 articles in the core dataset, 4 articles
for assessing training, and 54 articles for comparison to PARC3.

To provide core NLP annotations, such as tokenization, sentnece splitting,
To provide core NLP annotations, such as tokenization, sentence splitting,
part-of-speech tagging, constituency and dependency parsing, named entity
recognition, and coreference resolution, we provide annotations produced automatically by the CoreNLP software in parallel to the manual attribution annotations. See _Automated Annotation by CoreNLP_ below.

## Manual Annotation

### Training

All annotators were trained in two 2-hour periods, in which they reviewed the
the guidelines (see /annotation-guidelines/guidelines.pdf). after each major
the guidelines (see [`PolNeAR/annotation-guidelines/guidelines.pdf`](annotation-guidelines/guidelines.pdf)). After each major
section in the guidelines, we conducted a group discussion amongst the
annotators to answer any questions and rectify any misconceptions. Annotators
were provided practice 2 practice articles as practice annotation.
were provided 2 practice articles as practice annotation.

Annotators were then provided the templates document
(/annotation-guidelines/templates.pdf), which was designed to provide quick
([`PolNeAR/annotation-guidelines/templates.pdf`](annotation-guidelines/templates.pdf)), which was designed to provide quick
reference and examples to guide annotation.

After annotating the practice articles, we discussed the annotations as a
group, using the existing language in the guidelines to resolve disagreements
or misconceptions.

Near the end of the second training session, annotators were shown examples in
/annotation-guidelines/guidelines-training-interactive.pdf, and asked to
[`PolNeAR/annotation-guidelines/guidelines-training-interactive.pdf`](annotation-guidelines/guidelines-training-interactive.pdf), and asked to
describe how they would annotate it. The examples were designed to be
difficult, but to have a correct answer according to the guidelines.

### Training Articles

After training was complete, annotators annotated 4 articles, to measure their initial agreement and verify that training had been successful. These articles provide an indication of agreement level for annotators immediately after the training process.

### Ongoing Monitoring of Annotation Quality

Each annotator annotated approximately 18 articles every week. As a quality
control measure, weekly group meetings were held with all annotators in which
which we reviewed two articles that had been annotated by all annotators.
we reviewed two articles that had been annotated by all annotators.
During the meeting, the annotations that each annotator made in the two shared
articles were aligned to clearly show the cases where annotators had agreed or
disagreed on how to perform the annotation. The discussions were conducted to
encourage consensus by appealing to the existing guidelines and, especially,
the templates.

## Automated Annotation by CoreNLP

Automated annotations within directories named "corenlp" were produced by
running the CoreNLP software [2], using the following annotators: `tokenize`,
`ssplit`, `pos`, `lemma`, `ner`, `parse`, and `dcoref`; and with the output
Expand All @@ -217,16 +228,18 @@ format 'xml' chosen. The following was set in the properties file:
ner.model = 'edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz'

## Annotation Quality

The quality of annotations was assessed using various agreement-based metrics.
Please see the associated paper for results [1].

## Article Metadata

The file `PolNeAR/data/metadat.tsv` lists every article in PolNeAR and provides
The file [`PolNeAR/data/metadata.tsv`](data/metadata.tsv) lists every article in PolNeAR and provides
several metadata fields containing information about the article itself, and how it was annotated.

### Metadata about the articles
The following fields are hopefully self-explanatory:

The following fields are, hopefully, self-explanatory:
`filename`, `publisher`, `publication_date`, `author`, `title`,

The fields `trump_count` and `clinton_count` indicate the number of times
Expand All @@ -239,6 +252,7 @@ publisher has given credit for a story to another news publisher, or to a
wire service, such as AP or Reuters.

### Metadata about annotation

The fields `compartment`, `level`, and `annotators` indicate how the article was subjected to annotation. First, `compartment` indicates the compartment into which the annotation falls:
- `annotator-training` indicates that the articles were used to during
training of the annotators, to test their interannotator agreement and
Expand Down Expand Up @@ -276,14 +290,19 @@ PARC3 approach to annotation.


## Accompanying software

If you are a Python user, the easiest way to work with this dataset in Python
is to install the polnear module, and import it into your programs.

Go to /data/software and do:
To install the module to your current environment, simply navigate to the [`PolNeAR/software`](software) subdirectory and execute:

$ python setup.py install

Then, in your Python program, import the dataset as follows:
(**NOTE:** In case you are managing Python virtual environments via [`conda`](https://docs.continuum.io/anaconda/),
be aware that the `PolNeAR` dependencies will implicitly be installed directly by [`pip`](https://pip.pypa.io/en/stable/installing/).
In this case, it is recommended to locallize the [`pip`](https://pip.pypa.io/en/stable/installing/) installations during conda environment creation, to avoid dependency conflicts, by instantiating the environment with its own [`pip`](https://pip.pypa.io/en/stable/installing/) setup, à la `conda create --name custom_venv_name pip`).

Following installation, you can then easily import the dataset at runtime via the Python statement:

from polnear import data

Expand Down Expand Up @@ -363,7 +382,7 @@ an article as a `unicode`:
More interestingly, you can get a representation of the article with
annotations:

>>> annotated_article = data[0].annotated()
>>> annotated_article = article.annotated()

Here, `annotated_article` is an `AnnotatedText` object which is modelled
after the `corenlp_xml_reader.AnnotatedText` object. It
Expand All @@ -375,7 +394,7 @@ documentation](http://corenlp-xml-reader.readthedocs.io/en/latest/). Here, we
document the access of attribution annotations.

First, let's suppose you want to iterate over the sentences of a document,
and then do something every time you encounter an attribution.
and then do something each time you encounter an attribution.

>>> for sentence in annotated_article.sentences:
... for attribution_id in sentence['attributions']:
Expand Down Expand Up @@ -479,7 +498,7 @@ So in all, there are three ways to access attribution information:
2. Starting from a sentence, look to the value of `sentence['attributions']`,
3. Starting from a token, look to the value of `token['attributions']`.

Again, for more information on how to navigate the AnnotatedText object and access other annotations such as coreference resolution, dependency and constituency parses, etc., refer to documentation for [`corenlp_xml_reader.AnnotatedText`](http://corenlp-xml-reader.readthedocs.io/en/latest/).
Again, for more information on how to navigate the AnnotatedText object and access other annotations (such as coreference resolution, dependency and constituency parses, etc.), refer to the documentation for [`corenlp_xml_reader.AnnotatedText`](http://corenlp-xml-reader.readthedocs.io/en/latest/).


[1] _An attribution relations corpus for political news_,
Expand Down
91 changes: 91 additions & 0 deletions software/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
## *** Ignores courtesy of (https://github.com/kennethreitz/samplemod/blob/master/.gitignore) ***

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# IPython Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# dotenv
.env

# virtualenv
.venv/
venv/
ENV/

# Spyder project settings
.spyderproject

# Rope project settings
.ropeproject
4 changes: 2 additions & 2 deletions software/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@

# What does your project relate to?
keywords= (
'NLP natrual language processing computational linguistics ',
'NLP natural language processing computational linguistics ',
'PolNeAR Political News Attribution Relations Corpus'
),

Expand All @@ -69,5 +69,5 @@
packages=['polnear'],
#indlude_package_data=True,
install_requires=[
'parc-reader==0.1.5', 't4k', 'corenlp-xml-reader', 'brat-reader']
'parc-reader==0.1.5', 't4k>=0.6.4', 'corenlp-xml-reader>=0.1.3', 'brat-reader>=0.0.0']
)