Skip to content
simulacrum6 edited this page Mar 12, 2016 · 26 revisions

#Complex Word Identification Project #####Keywords: NLP, Word Difficulty, Complex Words, DKPro

##About The aim of this Project is to write a tool, capable of identifying Complex Words as defined in SemEval2016 Task 11.

Complex Word Identification consists in determining which words in a given sentence can challenge the readers of a certain target audience. The goal of the proposed Complex Word Identification shared task is to provide a framework for the evaluation of methods for this first step in a Lexical Simplification pipeline.

In order to achieve this, potential measures (see the concept section for more details on those) that could influence word difficulty were generated from intuition. These measures were then tested using a tool, developed with the aid of different Frameworks of the DKPro Family.


##General Approach

The following conjectures were used as a foundation for the project:

  • Familiarity: If people are unfamiliar with a word, they might rate is as more complex than a word they know. The familiarity was estimated by a word's frequency of usage. Frequency Counts were mainly retrieved from a list of the 5000 most common words in the english language src.

  • Word Length: The longer a word is, the more likely it may be complex. Word length was measured in characters per word. (Syllables per word were also considered, but due to the inconsistent results of the Syllable Count Annotator, Syllables were disregarded later on).

  • Etymological Background: Words with their roots in other language famillies were suspected to be more likely higher in complexity (,e.g. Apotheosis, Greek). This was testd indirectly by using character ngrams. If etymology is a determining factor for word difficulty, ngrams exclusive to other language families should be good predictors for the word difficulty.

  • Compund Word: Coumpound words could potentially be more complex than non-compund words. Unfortunately, this aspect was not tested in this project.

  • Ambiguity: The more meanings a word has, the more likely, it might challenge readers. Unfortunately, this aspect was not tested in this project.

  • Context: The words surrounding a given word can alter its meaning and might contribute to the difficulty of the word. Unfortunately, this aspect was not tested in this project.

  • Affixes: Certain pre- and suffixes occur only in certain fields (i.e. Medicine) and might challenge a reader. THis aspect was examined, by looking at character n-grams for each word.

These assumptions were generated mostly by using intuition and are not founded in empirical findings or language theory.


##Discussion

###Future Improvements

There are two main areas, which could be improved significantly in the future. Chief amongst them the Project's Theoretical Foundation and (self written) Code/Methods.

####Theoretical Foundation

The theoretical foundation for this project was rather weak, as the author has no background in Linguistics and only little literature has been reviewed in preparation. Relevant Literature regarding Word Complexity and Language Acquisition (particularly Second Language Acquisition) could have certainly aided the process. Furthermore, common approaches in Computational Linguistics, and the pros and cons of different machine learning algorithms should have been researched more thoroughly.

####Code

#####Extracted Features Not all of the original measures were scrutinised during the Project. Ambiguity, Etymology, Compound Words and Context in particular could be added to the list of features, taken into account by the project.

#####Machine Learning Algorithms

All evaluation algorithms used in the project are probalistic. Alternatively, regression models and Support Vector Machines could be considered to estimate word complexity.

#####Annotating & Feature Extraction

In its current state, all tokens from the dataset are annotated by all annotators in the preprocessing pipeline. Afterwards, Feature extractors select single Annotations, pertaining to the corresponding classification unit and use the values of these Annotations to create new features. Since the annotated data is directly converted into a feature, time could be saved by embedding the annotation data extraction in the feature extractor class, instead of creating these annotations beforehand. In this way only the relevant tokens are annotated.
_ (The decision to use DKPro TC was made halfway into the project, when some custom annotators had already been written, which is why the current (suboptimal) approach was implemented. In future projects this should be avoided.) _

###Outlook

The results show that this project is nothing more than a first step in the right direction. There is still a long way to go, in Complex Word Identification. The main merit of this project for the author is a more advanced understanding of and heightened interested in Computational Linguistics. This project can be used as a basis for future experiments in the field of Natural Language Processing.

back to navigation


Contact: marius.hamacher@stud.uni-due.de

Clone this wiki locally