Case Study – Brexit Visualisation

The SSIX (Social Sentiment Financial Indexes) project is a European Innovation Project sponsored by the European Commission under the Horizon 2020 framework. SSIX aims to provide European SMEs with a collection of easy to interpret tools to analyse and understand social media sentiment for any given topic regardless of locale or language (Davis et al. 2016). The United Kingdom recently had a referendum on European Union membership (commonly referred to as “Brexit”).

Political events such as Brexit receive great attention and spark discussions on social media. Much of the content produced on social media is highly emotive and expresses strong political opinions (Barbera 2015). Therefore, Brexit was selected as the initial real-world test case for validating the SSIX methodology and platform.

In this case study, we are just focusing on the data visualisation side and how we experimented it for Brexit. In a first part, we present the tools used and the requirements, then we explain how we dealt and got around with the implementation of the data base and the visualisation. At the end we evaluate the work done giving consistent feedbacks and recommendations. In order to lead off this work, two important points has to be considered:

  • Data were collected from Twitter using the official Streaming API, Incoming data were filtered based on a combination of different rules applied to the Twitter metadata (e.g. user language and number of followers).
  • A Deep Learning classifier was trained on a corpus of 2,000 tweets manually annotated for sentiment. Various setups were experimented with, and the best accuracy (69%) was achieved using two sentiment classes (“stay” and “leave”) and a Long Short-Term Memory (LSTM) neural network followed by a dense layer.

Methodology

Tools

ElasticSearch: For storing the data (version 2.2.3)

ElasticSearch is an open source distributed search and analytics engine ( based on Apache Lucene. Organizations around the world, from start-ups to governments to large corporations, are using Elasticsearch to gain real-time insights from large volumes of data and are one of the most popular projects in Github. Accessible with an HTTP web interface, Elasticsearch enables user to analyse, store, and search data in near-real-time. More explanation are given in the report.

Kibana  for visualising data(version 4.5.1)

Kibana is an open source data visualization plugin for Elasticsearch. It provides visualization capabilities on top of the content indexed on an Elasticsearch cluster. Users can create bar, line and scatter plots, or pie charts and maps on top of large volumes of data.

 

Requirements

The SSIX Dashboard has to be a web-based application that functions as the main user interface and administrative control panel for the SSIX System. It is intended to be a high level user interface for system configuration, data analysis and visualization. SSIX has to include dashboard configuring system parameters, basic data analysis, data visualizations, system monitoring and export of data. The visualisation has to satisfy these following points:

  • A Graphical User Interface (GUI) including widgets and dashboards incorporate correct colours and layout for effective use and clear communication;
  • The desired information is correctly presented;
  • Correct information units are used and are visible;
  • The information update intervals are clearly conveyed and are followed in display;
  • The needed filters are available and work accordingly;
  • The ability of GUI to correctly records inputs (mouse clicks and keyboard inputs).

For the Brexit visualisation, the main requirement was to show volume and sentiments as efficiently as possible enabling to see analytics presented visually, providing insights into political opinion mining.

 

Building the database

Designing the data structure was an important part, because visualisation techniques are highly dependent on the way how data are structured and defined. To achieve a good design, many tests had to be done in order to find the most effective data structure.

Sentiment Database

An index is like a ‘database’ in a relational database. It has a mapping which defines multiple types”.

At the beginning, the idea was to generate a single index containing the timestamp, sentiment and the strength of the sentiment:

In that case the sentiment can be UK “leave “or “stay “and the strength can be a value between 1 and 5. People can have the same sentiment for example “UK leave “but with a different strength: If one wants UK leave at 100% and the other one wants UK leave at 40%, so we attribute a strength of 5 to the first one and 2 for the second one.

This is an example of the first document structure idea:

{“timstamp”:”2016-05-04T18:44:27+0000″,”user_id”:”732727645″,”text”:”No.question? https://t.co/4mxNbl3X5e”,”sentiment”:”stay”,”strengh”:2}’

With this structure we could not do multiple things for visualising sentiment. A pie chart combined with its respective strength was the first idea.

Figure 1: Pie chart/strength

 

The pie chart is showing the percentage of “leave”, “stay”, and undecided sentiment. Each of this sentiment slices, is represented by its strength. As you can see there are many angles, many parts, much information, and many colours are used in this pie chart (Sentiments combined with strength) to let the brain easy understand the opinion. Even the volatility ( In finance, volatility is the degree of variation of a trading price series over time as measured by the standard deviation of returns, Wikipedia) of the sentiment cannot be shown with this visualisation.

After the previous limitation has come the idea to assign a value for each sentiment and strength combined. We assigned for “stay” sentiment the value 1, 0 for “undecided” and -1 for “leave”. In addition, we divided each strength value by 5 to have a final value between – 1 and 1. For “stay” sentiment and according to its strength (1, 2, 3, 4, or 5), the values can be 0.2, 0.4, 0.6, 0.8 and 1.The same for the “leave” sentiment where values could be -0.2, -0.4, -0.6, -0.8 and -1. Undecided always take the value 0. After generating those specific values (from -1 to 1 ), new visualisations became possible .

In parallel, we created an index in ElasticSearch called sentiment containing the sentiment score and the volume. Unfortunately, Kabana was not able to classify a value as positive or negative so we created a second index “polarity“ which classifies sentiment in 2 categories : “leave “or” stay”. “Undecided” sentiment was removed because the accuracy using two sentiment class (“stay” and “leave”) was better (69%). We could just add an attribute sentiment as a string in the first index, but it would be messy to do and maybe we will not need it in the future. The creation of the second index was just to solve a limitation of Kibana.

Generate fake data for testing

We generated a sample data for the purpose of loading into test ElasticSearch then visualise with Kibana. Visualising with fake data was useful because we know what we can visualise and what we cannot with a specific data set. Testing with fake data helped in catching errors , learning more about ElasticSearch and Kibana tools and therefore gives problems for a better data structure design .

 

Final indexes for Brexit

Two indexes were created in Elasticsearch to store the sentiment data points for the various temporal stratifications. Indexes were created according to the time division, currently proposed to be 1 minute, 5 minutes, 10 minutes hourly and daily. A data point has the following mapping schema in the Elasticsearch indexes:

Sentiment index:

“properties”: {

 “timestamp” : { “type”:”long” },   // UNIX timestamp

 “sentiment” : { “type”:”integer”}, // Fixed point with a value between -1.000 to 1.000               //representing sentiment

 “volume” : {“type”:”integer”}      //An integer indicating the volume of Tweets in a minute average

}

 

Polarity index:

“properties”: {

 “timestamp” : { “type”:”long” },   // UNIX timestamp

 “sentiment” : { “type”:”string”}, //sentiment can be leave or stay

}

 

For each n volume value in the sentiment index, n documents are created in polarity index. For example if we have the following document in the sentiment index,” {“timestamp”:”2016-05 04T18:44:27+0000″,”Sentiment”:0.334,”volume”:3}” we will have three documents in the polarity index (because as you can see in the previous document the volume value is 3. If the volume value would be 60 so we would have 60 documents in the polarity index).

{“timestamp”:”2016-05 04T18:44:27+0000″,”Sentiment”:”leave”}

{“timestamp”:”2016-05 04T18:44:27+0000″,”Sentiment”:”stay”}

{“timestamp”:”2016-05 04T18:44:27+0000″,”Sentiment”:”stay”}

 

Visualisations/Dashboard

Discovering

The first step was to discover what the visualisation tool can do and what is its limits, so we tried different kind of visualisation with the data which we have generated. Some of the first visualisations are in the dashboard on the figure 2. We kept exploring each of these techniques (pie chart, bar chart, area chart and line chart) until finding an appropriate and efficient way for visualisation.

 Figure 2: Early dashboard

 

Select an appropriate visualisation

The first thing to consider is what we are trying to achieve? We have to build visualization by answering to this question; otherwise analysis can quickly goes out of control. According to the requirements, volume and sentiment are the most social factors visualised for sentiment analysis.

Volume charts provide a better understanding of the change in social interest with a visual description, and is generally represented by bar chart which can have 3 main modes:

  • Stacked bar chart: Stacked bar chars is similar to a grouped bar charts and the bars are representing the sub-classes which are placed on top of each other to make a single bar (figure 3). The length of the bar shows the total size of the Volume in period of time. Different colours or are used to represent the relative contribution of each class.
  • Grouped bar chartGrouped bar charts are a technique of visualising information about different sub-classes of sentiment. In the figure 4, a grouped bar chart is used to describe the different sub-groups. A separate bar represents each of the sub-classes (leave and stay) and are usually presented with different colours to distinguish between them. Most of the time, a legend is provided to indicate what sub-class each colour represents.

Percentage bar charts are a king of stacked bar charts which are

  • Line chart : Volume can also be represented by a line chart as you can see in the figure 5. Line chart is helpful for showing specific values of data in a time period, showing trends in data clearly, enabling the viewer to make predictions about the results of data not yet recorded.

Area chart: Is generally used to show the percentage contribution of each sub-class representing the individual volume with the same size. Most of time, this kind of information is presented in a series of pie charts.

         

Figure 3: Stacked bar chart                              

Figure 4: Grouped bar chart

Figure 5: Line chart volume

Figure 6: Area chart volume

 

Those below visualisation techniques are useful for showing volume, however their use is depending upon the nature of the data to be presented, and the specific goal of the visualisation. In this context, area chart( Figure 6), line chart are almost similar, the choice of one of them in a dashboard will depend on how they will perceptually suit the other visualisations techniques.

 

Sentiment score is based on tweet message structure and polarity and is generated for each entity targeted (e.g. Stock). It represents the behaviour of people and their opinion. In our case, each the sentiment score is an average within 1 minute of each tweet sentiments score. The sentiment score are generally represented by a line chart (called trend chart in sentiment visualisation) and is usually considered as a central component, based on the selected data sources and time period. The figure 7 is an example of a trend chart we used for the Brexit visualisation.

Figure 7: Trend chart (17/20)

Final Dashboard:

Generally, people want to display different data in dashboard and very fast it becomes hard to read and full of meaningless, by consequence the user is overload and the dashboard is no longer useful. In our case we propose to display 3 visualisations to meet the requirements: A line chart for displaying the score sentiment, a bar chart to display the volume and an area chart to display the volume in percentage. We chose the stacked bar chart instead of the other kind for a perception question regarding to the other visualisation. You can notice that the interface projects 3 views (figure 8) so we tried to coordinated in a synchronized way: all the views have the same time axe and when you want to zoom in a specific interval of time, the zoom will be done for the other views This technique allows the user to assimilate easily the link between different views and locate the common items. So for each data point, we can know the sentiment score, the number of tweets (volume) and the percentage interest.

 

Figure 8: Final Dashboard

Another limitation of Kibana does not let you create an access control because by default Kibana dashboard is public. If you want to set up permission levels users, you will need to purchase Shield licence which is a component of elastic Search. To get around this problem, the idea was to create a web site where we put the different visualisation and by this way let the users interact with the visualisation by choosing the time interval they want (figure 9).

Figure 9: Brexit Dashboard

 

Reflexion

This Brexit case study was selected as the initial real-world test case for the validating the SSIX Visualisation. The work done looks very amateurish but is good enough for Brexit: The dashboard satisfied the entire requirement showing with efficiently the volume and the opinion trend. In one side, we’ll need dynamic charting capabilities side for the next financial use case and by consequence we’ll need to change for another tool more powerful. Keeping this visualisation as a generic will be a proof of concept for all case studies during the project lifetime. In another side, seeing how Kibana can be reused, improved or see what the commercial license is and how it cost per year. Another point is that red and blue colours are used most often to visualise data in relation to the stock market with the occasional. However, in political contexts avoiding the use of red might give the impression of us taking sides.

 

Summary

This experience gave us an overview about Kibana as a data visualisation tool working with elastic Search and some of the visualisation techniques which can be applied in sentiment analysis specially in political opinion mining. This work also served as base for the next case study of the SSIX project which will focus on visualising financial data.