Think you know English? Think again. See if you have what it takes to teach a computer how to understand humans.
Anyone who has tried to learn English as a second language is only too familiar with its many — many — challenges. In addition to idioms, sarcasm, and a wide array of meanings when combined with various prepositions (think: make up, make out, make it, and of course, makeup), there’s also pop culture, trends, products, and more to keep straight. Luckily for us, we’re human, and even those well established in their native languages will be able to speak and decipher English with enough practice and exposure. But what about machines? How do we even begin to program them in a way that they can read and understand sentiment? Answer: very carefully. The process requires machine learning data scientists to use Natural Language Process (NLP) techniques, a form of advanced analytics. They use these techniques to build models that can decipher sentiment and weed out the meaningful information among the noise.
NLP is what some of the scientists at Opera Solutions work on every day. The uses are plentiful. When you have technology that can understand humans, you can learn what your customers are thinking, weed out terrorists, or predict violent outbreaks. Here’s just a glimpse of what goes into programming computers to think more like us (with apologies to Microsoft)...
Don’t catch my drift? Google it (and other terms that make sentiment analysis difficult).
Sentiment analysis applies NLP techniques to unstructured text to extract subjective information on a topic from the perspective of the message author. Sentiment analysis is commonly used to determine how consumers feel about products based on product reviews. However, sentiment analysis is broader and may be used to identify threats or risks.
As an example, consider the following sentence:
I hate Windows.
Most people would agree that this sentence expresses a negative sentiment by the author toward the Microsoft operating system, Windows. This is a simple three-word sentence and it’s easy to read over it quickly without thinking about the complexities of how to understand what it means.
First, the sentence expresses the perspective of the author because of the word “I” indicating a first-person perspective. The sentiment is carried in the word “hate,” and the target of the sentiment is the object of the sentence “Windows.” “Windows” is expected to be a proper noun because it is capitalized, and the common understanding is the Microsoft Windows operating system.
But why do we expect this to be the operating system as opposed to a company name or a personal name? Why couldn’t this be referring to some guy named Joe Windows? Of course that is possible, but without additional information, people typically assume that a term like this refers to something common between the author and the reader. As the author is unknown, most people would interpret ”Windows” as something so famous that it doesn’t need to be identified further.
A sentence like this is fairly easy to get a computer to understand. It is easy to identify the subject, object, and sentiment. We can estimate the sentiment by simply looking for sentiment carrying terms in the message. In this case, the term ”hate” tells us that the message has a generally negative sentiment.
If the general polarity of the message is all that is required, then simply looking for these sentiment-carrying terms may be sufficient. However, in many cases, we want to differentiate the sentiment for different targets. In this case, we need to associate ”hate” with “Windows” to understand that the sentiment expressed is about Microsoft Windows and not about another product. This requires identifying the object of the sentence and may require sentence-parsing techniques for more complicated sentences.
However, consider this sentence:
Windows hates me.
This sentence simply swaps the subject and object. But this sentence is more likely an expression of frustration as opposed to an expression of sentiment. We don’t expect it to be sentiment because “Windows” isn’t a person and wouldn’t express a feeling like ”hate.”
Programming a computer to recognize these subtle differences is challenging. A program that simply looks at sentiment-carrying terms in the message wouldn’t be able to differentiate between the first and second sentences.
A more sophisticated program that is able to associate the sentiment to the object might still fail to properly interpret the message. For this, we need the common understanding that “Windows” is not a person and doesn’t express sentiment. This broader understanding may be captured in a database, but it would not only need to capture the concept of Windows but other general common-knowledge concepts as well. This would require a large dataset and would need to be regularly updated to keep it current as common knowledge evolves.
Even these simple three-word sentences can be difficult to interpret by a computer. Consider the two-word sentence:
That’s hot
Twenty years ago, this would universally be understood as a warning that a nearby object should be avoided. But now, it is understood to express a favorable opinion about something, and to further complicate things, this may be equivalent to “That’s cool.”
How do we understand when “That’s hot” is a warning or a favorable opinion? In spoken language, these expressions are differentiated based on the tone, cadence, and body language of the speaker. When expressed quickly, in an alarming tone, while pulling a hand away from a stove, we’d take this as a warning. When expressed slowly, in a level tone, while pointing to clothes on a shopping trip, we’d take this as an expression of sentiment.
If all we have is the two words here, there is no way to differentiate these. If there is additional text, then the general context may be used:
I really like that. That’s hot.
That’s hot. Don’t touch.
With additional context for the phrase we can distinguish between the two interpretations. However, even to do this, we would need to understand beforehand that the phrase ‘That’s hot’ can mean different things in different contexts. Even then, we’d need to have an indication of the ways the phrase may be used and specifically examine the surrounding context to differentiate these. Considering this is one phrase among many, and this is only one language among many, automating computers to properly decipher sentiment generally is a very challenging task.
Intrigued? There’s a lot more to learn. Opera Solutions’ SignalSensor Analyst delivers immediate and significant gains in open source intelligence capability by employing advanced machine learning science to enhance human ability to quickly extract and analyze the most relevant pieces of information from a vast and constantly expanding data universe. Our white paper, “Tactical Intelligence Derived from Social Media Sources,” explains how we used machine learning algorithms, language transliteration, dynamic context analysis, and other sophisticated methods to gather intelligence from social media that provided early indications of pending violence in Libya.
Brian Kolo is a principal scientist at Opera Solutions Government Services. He co-leads the Unstructured Data Analytics Group. Brian holds a B.S. in Engineering Physics from the University of Illinois at Urbana-Champaign, a J.D. from George Mason University, and a Ph.D. from the University of Virginia.