Making Computers Talk

Natural Language Processing lies at the  intersection of linguistics  and computer science,  and promises to change  the way we interact with  computers. Forever.

What is Natural Language Processing?

Natural language, or the language spoken by humans the world over, is  a fascinating construct. Not only does  each language have its own set of rules  concerning grammar and semantics,  but over time these languages spawn  dialects that modify these rules in subtle  ways. Computers, on the other hand are  programmed using a set of instructions that are collectively called a programming  language. These languages have their  own ‘grammar’ (called syntax) and their  own ‘vocabulary’ (called keywords) and  bear little or no resemblance with natural  language. While these programming  languages operate on a very strict set of  rules governed by logic, spoken language  is often far more fexible, and any attempt  to quantify these rules would fall under  the domain of ‘fuzzy logic’.

Natural Language Processing (NLP)  is an attempt to use the concepts of  linguistics and machine learning to  allow computers to understand and  communicate using natural languages.

The earliest attempts at getting computers to understand human languages  were in the 1950s, where machines were  programmed to translate texts between  different languages using a hard-coded  set of rules. However, such attempts failed quickly owing to the imprecise and ever-changing nature of language. Recent developments in the felds of machine  learning using artifcial neural networks  have used statistical models that ‘train’ the software to understand the subtleties  of language. Such systems can interpret  natural languages more accurately, owing  to their probabilistic approach, where  they are trained using a set of known  rules or words, and use this data to inter- pret new, raw data.

This form of NLP that uses machine  learning models has proven far more  successful than earlier approaches which  modelled themselves after computer  languages rather than human languages.

The problem in this approach was that  programming languages have fewer rules  which are enforced more strictly than  natural languages, which follow complex  and ever changing rules of grammar.

Language-specifc semantics is another  hurdle, as the literal meaning of words  may not always be accurate.

Machine learning models the problem from a human perspective. Humans tend  to learn a new language faster using  examples as opposed to strict rules, and  NLP algorithms try to use this very fact to  improve the accuracy of their systems.

Algorithms used in Natural Language Processing:

The Markov Model and Parts of  Speech

Most algorithms that classify text use  complex learning algorithms that involve  statistical models to extract meaning from text.  An older, less accurate but considerably simpler technique used is the Markov  model, which is used in classifers such as  Parts Of Speech (POS)  tagging or sentiment analysis. POS is a system which tries  to assign a tag to each word in a sentence,  identify it as an adjective, verb, noun,  article, singular, plural etc. Sentiment  analysis is a technique that looks for certain words or ‘identifers’ that determine  the overall ‘mood’ or sentiment of the sentence. Such analysis is useful analyzing  social media to gauge group sentiment.

One company putting sentiment  analysis to good use on social media data  is Paraktweet, which has launched two  products. The frst is called Bookvibe  which gives book recommendations based  on positive tweets about a tome, and the  second is TrendFinder which helps enter- tainment companies gauge the audience’s  response to a movie by following the con- versations around a movie or TV series.

Parsing systems have also been developed that analyse online reviews on sites like  Amazon to gauge the overall response to a  product, helping prospective buyers make  an informed choice.

Markov models are a class of probabilistic learning algorithms, where the  sentence is broken down into its individual constituent words, and represented  in the form of a tree or a chain. The system  contains a corpus of pre-classifed words which are used to perform a preliminary POS classifcation. However, one  word may belong to more than one part  of speech, and the classifcation of such  a word can never be completely unambiguous. For e.g. the word ‘train’ can be  used as a noun (a local train) or a verb (to train as a professional athlete). Which of the two is correct in a particular context, is  determined by using probabilities that are  assigned on the words based on historical  data. The words that immediately follow  or precede an ambiguous word may also  be analysed to assign such probabilities.

By repeating this process, the system  ‘learns’ over time; and it’s known as  Machine Learning.

One excellent application of a Markov  chain is in text-prediction, specifcally in  predictive keyboards in smartphones such as the Android 4.1+ keyboard, or Swiftkey. 

These employ a similar Markov model,  where the next word is predicted based on  a probabilistic model, taking into account  previously typed pairs of words as well  as a corpus of word-pairs found in the  training data for accurately predicting the next word that the user might type. Recent  updates to Swiftkey have even allowed  for the prediction of whole sentences  by extrapolating the same algorithm to  groups of words. Swiftkey’s predictions  tend to be eerily accurate, and the app literally ‘gets to know you’ with time.

Swiftkey learns your typing habits over time, which can get slightly...
embarrassing

A modifcation of a  Markov chain is called  Tree-traversal, where indi- vidual tokens (letters in a word, words in a sentence  or sentences in a block  of text) act as nodes of a  tree, and each parent node  has a child node for each  allowed permutation. As  an example, a spell-checker  would work by traversing  a tree with 26 parent nodes  (one for each letter of the  alphabet) and upto 26 child  nodes, who in turn will  have at most 26 children  of their own. Each level  of the tree represents the next letter in a  word (so a four letter word would be four  levels long), and the algorithm traverses  the nodes as the letters are entered in the  text document. The number of levels in  the tree would be the size of the biggest  word in the language. Every time the user  enters a combination of letters that is not  a valid path in the tree, the program will fag the word as being incorrectly spelled.

Suggesting an alternate spelling is far more diffcult though, as there is no way  of knowing which letter of the word has  been incorrectly entered, and will require  traversing each letter in the word-tree,  trying to fnd a word that has a spelling  similar to the mistyped word.

LexRank and Automatic Summarization

Thanks to the ubiquity of the Web, people  are fnding themselves at the receiving end of a frehose of new information  everyday, from  email and social  media updates to  news stories or  feeds from blogs  and websites. It is  practically impos- sible to consume and digest all of  this information  in a sane manner,  which has given rise to the need for automatic summarization of texts.

Summly provided summarized versions of news stories in a
gorgeous interface. No wonder Yahoo bought it!
Popular iPhone app Summly provided  summaries of top news stories in an elegant, well designed interface until it was  acquired by Yahoo, making the 17 year  old prodigy behind the app richer by $30  million.  Summify, a similar service that  aggregated content from all your social  feeds and delivered it in via email, web or  email was acquired by Twitter last year.

Not to be left behind, Google acquired  Wavii, a startup that offered a personalized, summarized news feed to users by combining stories from social networks and other web content.

One of the most popular algorithms  for performing automatic summariza-tion is the LexRank algorithm which  gets its name from Google PageRank that powers its searchengine. It works by  using a graph-like structure to represent  the sentences and their relationships  with each other. Each node of the graph is represented by a sentence, and the edges  connecting these nodes indicate the similarity between the sentences. The shorter  this edge, (or the closer two nodes are) the  more similar in meaning they are. The  similarity of the sentences is calculated by measuring the frequency of the words occurring in the sentences.

This leads  to low frequency words being assigned  a higher priority in the summary. The  importance of a sentence is determined by the relative importance of the adjacent sentences, similar to how PageRank  calculates the relative importance of a web page. A similarity matrix is then con-structed using this LexRank data, and all sentences with a score below a threshold  value are rejected.

Speech recognition –  Apple’s Siri and Google’s  voice search

While the process of recognising spoken  word and transcribing doesn’t exactly  fall under NLP, training computer systems to respond to the transcribed text is a very exciting application. The process of extracting meaningful and unambiguous text from an audio clip is done using an algorithm similar to the Hidden Markov Models described before.

The audio clip is sent through a Fourier Transform, which splits up the stream  into different frequencies (similar to an  equalizer), and a correlation algorithm  forms an association between the shape  of the curveand the (known) word being  spoken. A corpus of such training data  is formed, which allows the system to assign probabilities to each word spoken by the user. Based on these probabilities,  a Markov chain is constructed which gives a reasonable transcription of the  spoken word into text.

Once the text has been extracted,  virtual assistants like Siri, Google Voice
Search and Siri-like alternatives on  Android such as Iris, Vlingo and Robin  parse the text and break it into parts of speech. The second part of the task  involves extracting useful information  from the text, such as the nature of the query and the best way to respond to it.

These systems use a pre-indexed database  that is used for replying to queries, which can include both offine data as well as online search engines to deliver the most relevant response.

Another age-old problem that is seeing a new lease of life is that of machine translation. Google Translate is a great example of applying the modern techniques of machine learning to automatic translations, as earlier attempts based on hard-coded rules failed while translating beyond simple phrases. This is still an open problem, however, as  translating large chunks of texts (such as  scientifc papers, journals or novels) is far  from perfect.
Google debuted its updated Voice Search in
Android 4.1 Jelly Bean
Not to be dissuaded however, researchers at Carnegie  Mellon University have come  up with a unique crowd-sourcing model of translating  documents on the Internet  using an app called Duolingo.

The website (along with mobile  apps for iOS and Android)  gamifes the process of learning  a new language, where users  are rewarded with points for  completing new lessons. The  service, which is completely  free of cost, tests these newly  acquired language skills of its users by asking them to translate documents from/to  the language they are learning.

A statistical model is employed  that correlates the translated  documents submitted by thou- sands of its users to weed out  any mistranslations.

Facebook’s newest feature, Graph Search employs Markov  Models to break down the search string into a hierarchy  of components that are then  used as commands to query  the database. For e.g., “Friends  who live in Mumbai” will look for a union  of all users who satisfy both conditions  simultaneously. Permutations of this  string such as “Friends in Mumbai” or  “My Mumbai friends” also point to the  same dataset, as the parsing engine is  intelligent enough to distinguish between  the words in the string that are essential  (friends, Mumbai) versus those that are non-essential (living, my). This allows for  highly specifc searches using data your  friends have entered.

As statistical models for processing  natural language data get more effective, the divide between programming  languages and natural languages seems  to be closing. Data scientists are fnding  new ways to parse human languages into  tokens of data, leading to new high level  programming languages being developed that lie very close to to spoken languages on the scale, making programming accessible to a wider audience. Intelligent voice- activated assistants like Siri are indicative of the massive progress that NLP has  made over the years, but localising these  systems into multiple languages, each  with their own idiosyncrasies is another  challenge that engineers are working hard to tackle. 

Apple’s Siri employs highly sophisticated classifers to
parse speech input
Facebook’s Graph Search uses NLP to fnd the most
relevant search results

0 comments:

Post a Comment