SemEval 2017 – Task 5: Competition on Fine-Grained Sentiment Analysis

Following on from our previous SemEval post ‘SSIX is organising a Shared Task in the International Workshop on Semantic Evaluation (SemEval) 2017’; where a group of researchers involved in the SSIX project organised a task in SemEval 2017 – Task 5 and the respective subtasks which addressed fine-grained sentiment analysis on microblogs in subtask 1, and financial headlines in subtask 2.
As the SemEval 2017 competition has finished, this post provides a more detailed insight summarising the data, evaluation methods and results.
The task was motivated by the high interest in the field of sentiment analysis related to finance and aimed at catalysing discussions around approaches of semantic interpretation of financial texts. By fostering research focussing on financial sentiment analysis with a concrete task, we intended to support the development of new state of the art approaches. The goal of our task was to assign a sentiment between -1 (very negative) and +1 (very positive) with 0 to given entities (companies or cashtags).

What was provided?

We provided training, test and evaluation datasets created from Twitter and StockTwits for subtask 1, and headlines from various sources on the Internet for subtask 2. Overall, the training datasets comprised 1,700 tweets for subtask 1, and 1,142 headlines for subtask 2; the two evaluation datasets contained 794 tweets and 491 headlines. The participants’ results were evaluated using the cosine similarity as presented below.

Additionally, the submitted predictions have been weighted to avoid teams that only submit very good predictions.

Our task received a lot of attention leading to 32 teams participating in subtask 1 and subtask 2. Various approaches have been chosen to receive the best results and win the competition. The majority of teams used hybrid methodologies utilising a combination of Machine Learning (ML) or Deep Learning (DL) with lexicons or ontologies. Pure ML approaches were used in five cases, and pure DL was used by only three teams.
The cosine similarity scores ranged from 0.003 to 0.778 (subtask 1) to 0.016 to 0.745 (subtask 2). Looking at the results below, it can be said that powerful prediction systems have been built.

The full paper is available from


This blog post was written by SSIX partner NUIG.
For the latest update, like us on
 Facebook, follow us on Twitter and join us on LinkedIn.