Multilingual, domain-specific data analytics

When a small or medium sized enterprise has developed a great language technology application for English it is often a major investment to enable comparable functionality in other languages and launch it in countries where English is not the first language. A report from the European Commission http://ec.europa.eu/public_opinion/archives/ebs/ebs_386_en.pdf found that only 38% of Europeans speak English as a foreign language, and about 66% of those claim to speak it well or very well. Furthermore, how many people in e.g. Germany, Italy, or Finland have the language setting in their smartphones or laptops set to English? In our experience, not many. To make an application accessible and relevant to users worldwide supporting the local languages is a necessity.

Localizing a language technology application is not only about localizing the user interface and manual. If the application is relying on background information, e.g. from Wikipedia, maps, social media or news sites, then also the backend of the application needs localization to be relevant in another country. For example, you may want the app to query Handelsblatt (Germany) or Kauppalehti (Finland) instead of, or in addition to, Financial Times. If the application relies on language processing, then you may need to invest in substantial implementation work as your current statistical classifier or language rules and templates are based on the English language that does not work well when the user says or types “Var finns närmaste mataffär?” instead of “Where is the nearest supermarket?”. 

In the SSIX project we aim to enable multilingual support from the beginning, to make the SSIX platform available and relevant in as many countries as possible. The major areas we will address to enable multilingual functionality within the SSIX project are:

  • Language resource development;
  • Text data collection and annotation;
  • Natural language processing component development;
  • User interface localization;
  • User testing.

One of the main objectives of the SSIX platform is to assign sentiment to financial texts. The texts are extracted from several sources of social networks, news sites and blogs, and in several languages (for more information see http://ssix-project.eu/dealing-with-big-data-and-its-challenges/).  There exist a number of sentiment classifiers available in different languages for movie or restaurant reviews, but in the finance domain, the often used positive vs negative sentiment may not be appropriate. Consider the following two tweets:

  • Today will be big buy back Friday.
  • I’m selling everything at bell.

These tweets don’t make any claim about the writer’s sentiment. Still, they express two very different attitudes with a clear meaning to a trader. In this context the bullish vs bearish dichotomy is often used:

  • Today will be big buy back Friday – Bullish.
  • I’m selling everything at bell – Bearish.

A sentiment analyser developed e.g. for movie or restaurant reviews will therefore not perform satisfactory in the finance domain. The SSIX consortium will develop a dedicated classifier, trained on the financial domain. The domain specific language resources will be created for multiple languages in the most efficient way by using a combination of human annotators, machine learning and machine translation.

Another challenge we face in the SSIX project is that most available NLP components in languages other than English, e.g. tokenizers, lemmatizers and part-of-speech taggers are developed for news text and do not work well when applied to the writing used in social media with informal spelling and special language like emoticons or hashtags. Here we will primarily investigate “short-cut” solutions, e.g. text normalization rules, to allow re-using available NLP components to as large extent as possible.

SSIX will also localize the user interface into several languages and perform user testing to ensure that the SSIX services meet or exceed the end users expectations.

Companies around the world contact Lionbridge to facilitate making their language technology products and services available on a global scale. In the SSIX project Lionbridge collaborates closely with the software engineers to develop language specific resources that best support development and testing.

The SSIX project still has a challenging road ahead and Lionbridge feels privileged to be part of it.

 

This blog post was written by Lionbridge, one of the partners of SSIX.

For more information on SSIX, visit our website SSIX-project.eu.

For the latest update, like us on Facebook, follow us on Twitter and join us on LinkedIn.