Making Computers Talk
What is Natural Language Processing?
Natural language, or the language spoken by humans the world over, is a fascinating construct. Not only does each language have its own set of rules concerning grammar and semantics, but over time these languages spawn dialects that modify these rules in subtle ways. Computers, on the other hand are programmed using a set of instructions that are collectively called a programming language. These languages have their own ‘grammar’ (called syntax) and their own ‘vocabulary’ (called keywords) and bear little or no resemblance with natural language. While these programming languages operate on a very strict set of rules governed by logic, spoken language is often far more fexible, and any attempt to quantify these rules would fall under the domain of ‘fuzzy logic’.
Natural Language Processing (NLP) is an attempt to use the concepts of linguistics and machine learning to allow computers to understand and communicate using natural languages.
The earliest attempts at getting computers to understand human languages were in the 1950s, where machines were programmed to translate texts between different languages using a hard-coded set of rules. However, such attempts failed quickly owing to the imprecise and ever-changing nature of language. Recent developments in the felds of machine learning using artifcial neural networks have used statistical models that ‘train’ the software to understand the subtleties of language. Such systems can interpret natural languages more accurately, owing to their probabilistic approach, where they are trained using a set of known rules or words, and use this data to inter- pret new, raw data.
This form of NLP that uses machine learning models has proven far more successful than earlier approaches which modelled themselves after computer languages rather than human languages.
The problem in this approach was that programming languages have fewer rules which are enforced more strictly than natural languages, which follow complex and ever changing rules of grammar.
Language-specifc semantics is another hurdle, as the literal meaning of words may not always be accurate.
Machine learning models the problem from a human perspective. Humans tend to learn a new language faster using examples as opposed to strict rules, and NLP algorithms try to use this very fact to improve the accuracy of their systems.
Algorithms used in Natural Language Processing:
The Markov Model and Parts of Speech
Most algorithms that classify text use complex learning algorithms that involve statistical models to extract meaning from text. An older, less accurate but considerably simpler technique used is the Markov model, which is used in classifers such as Parts Of Speech (POS) tagging or sentiment analysis. POS is a system which tries to assign a tag to each word in a sentence, identify it as an adjective, verb, noun, article, singular, plural etc. Sentiment analysis is a technique that looks for certain words or ‘identifers’ that determine the overall ‘mood’ or sentiment of the sentence. Such analysis is useful analyzing social media to gauge group sentiment.
One company putting sentiment analysis to good use on social media data is Paraktweet, which has launched two products. The frst is called Bookvibe which gives book recommendations based on positive tweets about a tome, and the second is TrendFinder which helps enter- tainment companies gauge the audience’s response to a movie by following the con- versations around a movie or TV series.
Parsing systems have also been developed that analyse online reviews on sites like Amazon to gauge the overall response to a product, helping prospective buyers make an informed choice.
Markov models are a class of probabilistic learning algorithms, where the sentence is broken down into its individual constituent words, and represented in the form of a tree or a chain. The system contains a corpus of pre-classifed words which are used to perform a preliminary POS classifcation. However, one word may belong to more than one part of speech, and the classifcation of such a word can never be completely unambiguous. For e.g. the word ‘train’ can be used as a noun (a local train) or a verb (to train as a professional athlete). Which of the two is correct in a particular context, is determined by using probabilities that are assigned on the words based on historical data. The words that immediately follow or precede an ambiguous word may also be analysed to assign such probabilities.
By repeating this process, the system ‘learns’ over time; and it’s known as Machine Learning.
One excellent application of a Markov chain is in text-prediction, specifcally in predictive keyboards in smartphones such as the Android 4.1+ keyboard, or Swiftkey.
These employ a similar Markov model, where the next word is predicted based on a probabilistic model, taking into account previously typed pairs of words as well as a corpus of word-pairs found in the training data for accurately predicting the next word that the user might type. Recent updates to Swiftkey have even allowed for the prediction of whole sentences by extrapolating the same algorithm to groups of words. Swiftkey’s predictions tend to be eerily accurate, and the app literally ‘gets to know you’ with time.
Swiftkey learns your typing habits over time, which can get slightly... embarrassing |
A modifcation of a Markov chain is called Tree-traversal, where indi- vidual tokens (letters in a word, words in a sentence or sentences in a block of text) act as nodes of a tree, and each parent node has a child node for each allowed permutation. As an example, a spell-checker would work by traversing a tree with 26 parent nodes (one for each letter of the alphabet) and upto 26 child nodes, who in turn will have at most 26 children of their own. Each level of the tree represents the next letter in a word (so a four letter word would be four levels long), and the algorithm traverses the nodes as the letters are entered in the text document. The number of levels in the tree would be the size of the biggest word in the language. Every time the user enters a combination of letters that is not a valid path in the tree, the program will fag the word as being incorrectly spelled.
Suggesting an alternate spelling is far more diffcult though, as there is no way of knowing which letter of the word has been incorrectly entered, and will require traversing each letter in the word-tree, trying to fnd a word that has a spelling similar to the mistyped word.
LexRank and Automatic Summarization
Thanks to the ubiquity of the Web, people are fnding themselves at the receiving end of a frehose of new information everyday, from email and social media updates to news stories or feeds from blogs and websites. It is practically impos- sible to consume and digest all of this information in a sane manner, which has given rise to the need for automatic summarization of texts.
Summly provided summarized versions of news stories in a gorgeous interface. No wonder Yahoo bought it! |
Not to be left behind, Google acquired Wavii, a startup that offered a personalized, summarized news feed to users by combining stories from social networks and other web content.
One of the most popular algorithms for performing automatic summariza-tion is the LexRank algorithm which gets its name from Google PageRank that powers its searchengine. It works by using a graph-like structure to represent the sentences and their relationships with each other. Each node of the graph is represented by a sentence, and the edges connecting these nodes indicate the similarity between the sentences. The shorter this edge, (or the closer two nodes are) the more similar in meaning they are. The similarity of the sentences is calculated by measuring the frequency of the words occurring in the sentences.
This leads to low frequency words being assigned a higher priority in the summary. The importance of a sentence is determined by the relative importance of the adjacent sentences, similar to how PageRank calculates the relative importance of a web page. A similarity matrix is then con-structed using this LexRank data, and all sentences with a score below a threshold value are rejected.
Speech recognition – Apple’s Siri and Google’s voice search
While the process of recognising spoken word and transcribing doesn’t exactly fall under NLP, training computer systems to respond to the transcribed text is a very exciting application. The process of extracting meaningful and unambiguous text from an audio clip is done using an algorithm similar to the Hidden Markov Models described before.
The audio clip is sent through a Fourier Transform, which splits up the stream into different frequencies (similar to an equalizer), and a correlation algorithm forms an association between the shape of the curveand the (known) word being spoken. A corpus of such training data is formed, which allows the system to assign probabilities to each word spoken by the user. Based on these probabilities, a Markov chain is constructed which gives a reasonable transcription of the spoken word into text.
Once the text has been extracted, virtual assistants like Siri, Google Voice
Search and Siri-like alternatives on Android such as Iris, Vlingo and Robin parse the text and break it into parts of speech. The second part of the task involves extracting useful information from the text, such as the nature of the query and the best way to respond to it.
These systems use a pre-indexed database that is used for replying to queries, which can include both offine data as well as online search engines to deliver the most relevant response.
Another age-old problem that is seeing a new lease of life is that of machine translation. Google Translate is a great example of applying the modern techniques of machine learning to automatic translations, as earlier attempts based on hard-coded rules failed while translating beyond simple phrases. This is still an open problem, however, as translating large chunks of texts (such as scientifc papers, journals or novels) is far from perfect.
Google debuted its updated Voice Search in Android 4.1 Jelly Bean |
The website (along with mobile apps for iOS and Android) gamifes the process of learning a new language, where users are rewarded with points for completing new lessons. The service, which is completely free of cost, tests these newly acquired language skills of its users by asking them to translate documents from/to the language they are learning.
A statistical model is employed that correlates the translated documents submitted by thou- sands of its users to weed out any mistranslations.
Facebook’s newest feature, Graph Search employs Markov Models to break down the search string into a hierarchy of components that are then used as commands to query the database. For e.g., “Friends who live in Mumbai” will look for a union of all users who satisfy both conditions simultaneously. Permutations of this string such as “Friends in Mumbai” or “My Mumbai friends” also point to the same dataset, as the parsing engine is intelligent enough to distinguish between the words in the string that are essential (friends, Mumbai) versus those that are non-essential (living, my). This allows for highly specifc searches using data your friends have entered.
As statistical models for processing natural language data get more effective, the divide between programming languages and natural languages seems to be closing. Data scientists are fnding new ways to parse human languages into tokens of data, leading to new high level programming languages being developed that lie very close to to spoken languages on the scale, making programming accessible to a wider audience. Intelligent voice- activated assistants like Siri are indicative of the massive progress that NLP has made over the years, but localising these systems into multiple languages, each with their own idiosyncrasies is another challenge that engineers are working hard to tackle.
Apple’s Siri employs highly sophisticated classifers to parse speech input |
Facebook’s Graph Search uses NLP to fnd the most relevant search results |
0 comments:
Post a Comment