Learning Framework

Learning Framework is the latest machine learning plugin in GATE. It is still work in progress plugin, however, stable enough to use it. Learning Framework supports up to date machine learning algorithms like:

  • Mallet classification algorithm
  • Mallet CRF algorithm
  • LibSVM

Additionally, it provides Wrappers for other Machine Learning tools and libraries like WEKA, SciKit-Learn, CostSensitive Classification and Keras.

Learning Framework implements different tasks such as:

  • Classification – It basically assigns the target class to the annotated instance. One of the examples is sentiment analysis, where each annotated sentences can be classified as having positive or negative sentiment.
  • Regression – It assigns the numerical value. This is similar to classification, instead of assigning the class label, numerical targets are assigned to the annotated instances.
  • Sequence Tagging – Also called as chunking problem. This annotates the instances learned from the model built by training data.
  • Export of the training data in different formats like CSV, ARFF which further can be used in WEKA or any other tools.
  • Evaluation tasks to check the performance of classification, regression and chunking tasks.

LearningFramework comes under “gateplugin-LearningFramework” plugin and contains the following PRs:

  • LF_TrainClassification for training a classification model.
  • LF_ApplyClassification to apply a trained classification model.
  • LF_TrainRegression for training a regression model.
  • LF_ApplyRegression to apply a trained regression model.
  • LF_TrainChunking for training a model for sequence tagging/chunking.
  • LF_ApplyChunking to apply a trained model for sequence tagging/chunking.
  • LF_Export to export a training set to an external file.
  • LF_EvaluateClassification estimate classification quality.
  • LF_EvaluateRegression estimate regression quality.

To begin, we can opt for different tasks to be carried out through the Learning framework like mentioned above (classification, chunking, etc.). The corpus should contain a large volume of training data in order to obtain good results. The training data should come with annotated information which is called a ‘Gold Standard’. These instances are the actual ones which will be learnt by the machine learning models and predict on the unseen data.
For the classification tasks, the annotations will run over a span of tokens or words based on the problem definition. For Example, classifying the sentence to its appropriate language, the whole sentence will be one instance and the feature type of each one will be its respective language. For the chunking problem, it involves spanning through the window of text and annotating the target type from the available types. For example, annotating the location under the type ‘Location’. Additionally, other features can be annotated on the text data that will be useful for increasing the performance. For example, in finding the person names, it will be useful in annotating in the token levels. Another important feature of writing JAPE (JAVA Annotation Patterns Engine) rules is available.
Now, the feature specification file needs to be prepared, describing more features to use to learn. For example, in finding the person names bigrams would be more likely useful as most of the names will contain first name and last name. Or in finding the occupation names bigrams and trigrams can be useful like Database administrator, Senior application developer, etc, Later, any appropriate training PR can be applied and this requires the runtime parameters like the path to feature specification file, path to save the model files, which machine algorithm to use(SVM, CRF, etc,.), ML parameters, instances and annotation to learn.
After the learning the model, appropriate application PR can be applied and needs to point to the learned model folder path as a runtime parameter. At the end, evaluation PR can be used for finding the accuracy of the model and can be improved by changing the ML parameters or trying with different features. Also, evaluation can be done by exporting the trained model to appropriate format such that, it is compatible with external tools, for example, ARFF files to load into WEKA.

Recommended Reading:
https://github.com/GateNLP/gateplugin-LearningFramework/wiki