Dimensionality Reduction with Entity Extraction

by Nimrod Priell


One overlooked application of entity extraction (a fairly generalizable technique, it is now widely available in API form like OpenCalais, Parse.ly or AlchemyAPI) is to reduce the size of the feature space in NLP problems. Because of the curse of dimensionality (elucidated in the paramount text on machine learning, The Elements of Statistical Learning) classifiers generally perform worse as the number of features increases, and often this decline in performance dominates any gains due to applying more advanced algorithms such as neural networks, SVMs and so on. While the “data trumps algorithm” adage common in machine learning circles usually instructs us to bring in more training documents, with NLP more data corresponds to an increase in the dimensions of the feature space as well. That increase is normally roughly o( N ) where N is the corpus size (known as Heaps’ Law Heaps’ Law), and is bounded above by the size of the vocabulary of the language. However, the problems of increased dimensionality can be (very roughly) explained as requiring an exponential growth in data, so that the net effect is being far outweighed in favor of restricting the dimensionality. An added complication is that using n-grams as features, as is common in NLP, means the growth of the feature space with every new word is accentuated.

A solution has always been pruning – removal of the most frequent and least frequent tokens – but the costs incurred are that some meaning is lost. Using entity extraction, however, we have a solution where in the right context, no meaning is lost while a significant reduction in the size of the space is attained. Consider the classic “movie review sentiment” problem: Many of the tokens will be names of places, cast or names of other films, and while these do potentially carry predictive weight (for example, a film compared to “Citizen Kane” might usually suggest a positive review), the underlying hypothesis is that the sentiment extracted by the classifier is related to the language used rather than the subjects of comparison. In other words, what we would like the classifier to do is attach strong weights to the words “notably”, “influential” and perhaps “megamelodrama” in the sentence:

“But Mr. Cameron, who directed the megamelodrama “Titanic” and, more notably, several of the most influential science-fiction films of the past few decades (“The Terminator”, “Aliens” and “The Abyss”)…”

(excerpted from the New York Times’ review of “Avatar”), rather than drag in whatever score has been attached to other movie reviews citing The Terminator and Titanic or comparing to James Cameron. Instead, consider classifying

“But DIRECTOR, who directed the megamelodrama FILM and, more notably, several of the most influential science-fiction films of the past few decades (FILM, FILM and FILM)”

where we have eliminated 5 rare tokens for 2 common ones, preserving the meaning of the sentence at least insofar as the limited inferential power of a token-based classifier is concerned.

One may consider removing these terms altogether – however, consider that the meaning of sentences might be obscured, and most likely “notorious DIRECTOR” carries a different weight in a movie review than just the mention of a character known to be notorious in the outline of the film’s plot.

Thus entity extraction acts like a feature hashing technique where the feature space is reduced while terms with similar effect on the meaning of the sentence are bunched up together in a bin. The feature space, which is usually pervaded with unfrequent occurrences of multitudes of names and terms, the consequence of which is the inability to correctly infer score for n-grams such as “glorious Natalie” (compared to “glorious Angelina”, “glorious Keira” and so forth) is both reduced in size and enriched in more accurate probability estimates, at the cost of pre-processing the texts through an entity extraction algorithm.

Actual gains in accuracy are so widely varied by the type, quantity and quality of data, and the classification algorithm and parameters used, that I hesitate to provide any measure here. Suffice it to say I have gotten significant increases out of this before, where pruning beyond a point was losing meaning instead of helping proper estimation. As always in machine learning, this is a shot you take and test in a controlled fashion on unobserved data to inform yourself of its effectiveness. I would love to hear about your benchmarks using this technique in the comments!