GATE 2017 Summer School

GATE is a Natural Language Processing (NLP) framework developed in the JAVA language. Training courses on GATE are held annually at the end of June and organised by the University of Sheffield where GATE was originally developed. GATE is a defined and repeatable process for creating robust and maintainable text processing workflows. It offers a graphical user interface called GATE Developer that can be used to create pipelines using different Language and Processing Resources. This is a popular choice for building NLP applications. GATE is also available as a library called GATE Embedded that can be used in different software applications.

The 2017 GATE training course was structured and duration was one week. The trainers covered most of the topics and tools that are used in GATE. The training sessions involved theoretical explanations followed by the practical implementations. All the course materials and supporting documents can be found in GATE training course webpage. GATE developer mainly consists of following resources:

  1. Application: This holds a groups of Processing Resources which can be run on the corpus.
  2. Language Resources: This holds the documents or collection of documents (corpus).
  3. Processing Resources: These are the annotation plugins that operate over the documents.
  4. Datastores: Specialized files where documents are stored for future usages.

Figure 1: GATE UI showing available resources.

The following are some of the readymade applications provided by default:

  1. ANNIE (A Nearly New Information Extraction system): It is a ready made application consisting of some Processing Resources that performs information extraction on unstructured text.
  2. Noun Phrase Chunker: Annotates Noun Phrase’s in the sentence.
  3. TwitIE: To process the microblog texts (Twitter).

The below are the some of the most commonly used processing resources:

  1. Document Reset: This Processing Resource brings back the document to its original state by removing all the annotation sets apart from set containing the document format analysis.
  2. Sentence Splitter: Sentence splitter Processing Resource segments the text into sentences. This uses gazetteer list and also rule based method to find the sentence endings.
  3. Tokenizer: It splits the text data into very simple tokens like punctuations, numbers and words.
  4. Gazetteer: This Processing Resource locates the entity names based on the given lists.
  5. POS tagger: This produces a part-of-speech tag as an annotation or a feature of the annotation type on each word or the symbols.

GATE also supports machine learning tools. Batch learning and Machine learning PR’s are the plugins for machine learning tasks. Recently a new plugin called Learning Framework is introduced in GATE which integrates many libraries, including Mallet’s CRF and some deep learning algorithms.

 

This blog post was written by SSIX partner NUIG.
For the latest update, like us on
 Facebook, follow us on Twitter and join us on LinkedIn.