The EMC Israel COE has launched a Big Data competition open to the Israeli data science community on Kaggle.com, the winner of which will receive a cash prize of $10,000. The competition, which will run until August 2012, is geared towards individuals, groups (of up to five people) and startup companies, and is aimed at increasing awareness of Big Data and the data science profession in particular, while contributing to the creation of new algorithms. EMC Israel invites all those with a background/experience in machine learning, mathematics, statistics, computing, economics and physics – and any other interested parties, to try their luck in solving the challenge that awaits them at the site. Those who enter the competition will receive a real database from an open source code containing thousands of files; the challenge is based on the automatic identification of content, and the prize will go to the party that comes up with the ideal algorithm.
Imagine you’re a lumberjack, and you want to ship a whole forest to some far away place. You’d want to make sure you pack it as tight as you can. No, wait. Imagine you’re a software engineer and you want to transfer a huge forest (just a bunch of trees, or rooted undirected acyclic graphs, that is) across the Internet. How would you do it efficiently?
Let’s make it clear that we’re looking for a way to transfer the tree structure. That is separate from transferring the contents of the nodes. If the trees you’re representing are, for example, social network influences (e.g tweets emitted from a certain person and retweeted by his followers, and theirs, and so on), you can keep the names of the persons involved, the tweet and the time it was retweeted at in a separate message. Then you map each node to a number from 1 to the size of your tree (say, N), and our question is how to transfer these numbers in a way that the other side can understand that node number 3 is a child of node number 11, and 14 is a child of 3, for example.
So let’s start with one tree. Some trivial ideas come to mind, like writing down a node, then the number of children it has, then the identifiers for the child nodes, their number of children, and so forth. You must make a decision to rebuild the tree either depth-first or breadth-first. That’s a fine representation. But it turns out to be a little wasteful. You’re transferring every node (say, out of N nodes), and then you’re also writing down the number of children that each node has. So that’s 2N pieces of information, each of which can take N different values: The number of nodes is comparable to the number of children in the worst case (a tree with 1 level, all nodes being leaves whose parent is the root), so you can’t go much lower than that. You can come up with various tricks to optimize this trivial representation, but we should ask another question instead.
How much can we compress a general tree structure? The question is really, “how many trees are there?”. If we know how many trees there are, and we find a representation whose size is comparable to the number of trees, then we know we have the minimal representation: We can’t go smaller than that, because if there are less words (meaning possible instances of our representation) than there are trees, there will be no way of telling apart two trees that are represented with the same word. If however we end up sending more data than there are trees, then we’re not optimal. What we’re looking for, then, is a bijection between trees and possible words in our representation.
If you’re familiar with Graph Theory, you know that Cayley’s formula for trees says there are NN-2 possible trees with N nodes. If we can have a language with NN-2 words, each representing a different tree, we’ve gotten as lean as we can get. The 2N pieces of information from before have N possible values for each piece (since both the node identifiers themselves and the number of children can be any number from 1 to N*). They represent a space of N2N options, then – much, much larger than the space of trees. Can we find a code that represents the tree using N-2 words with N options each? Think about it – that means using even less than the identifiers of all nodes** – since there are N nodes, but we transfer just N-2 of them, in some order, and get our tree back.
This bijection exists (in fact, several useful variations do), and it’s called Prüfer codes. In a Prüfer code, we assume we both know what N is (or I can just send it as my first word). I then give you N-2 numbers from 1 to N and they tell you how to build the tree. How do I do it? I take a tree, say, (a) from the figure below.
Now I take the largest leaf, write down its parent, and remove it. The largest leaf is 4 (5 is not a leaf. 2 and 3 are leaves smaller than 4). I write down its parent’s identifier, “1”, and remove 4 and the edge connecting it to 1 from the tree. The next largest leaf is 3. (Only 2 and 3 are leaves now). I write down its parent, “5”, and remove 3 and the node connecting it to 5. Now only 2 is a leaf, whose parent is 1. Writing “1” down again, we have “1,5,1” so far. The tree is down to just 5 and 1. That’s it. The tree is a connected graph, so the must be an edge between them, and we’re done: Our code for this tree is “1,5,1”. The number of rooted trees, those with a distinguishable root (like I drew them above) is NN-1 – since for every tree I can arbitrarily decide which of the nodes is the root and it’ll make it into a rooted tree: So there are N options for each of the NN-2 trees, for a grand total of NN-1. To transfer a rooted tree, I could’ve then either specifically stated that 5 is my root (say, by specifying the root as the first word, something like “5:1,5,1”) or alternatively I could’ve continued one more step with the Prüfer sequence: Since I have only 2 nodes remaining, taking the parent of the largest leaf is always equivalent to specifying the root. So that’s how you get the Prüfer code for a tree: The code for trees (b) and (c) turns out to be “4,4,1,2” and “2,2,2,2” respectively – Work it to see you follow the idea.
Note that these trees are not ordered: That is, we don’t have a discernible “right” and “left” child, so not all algorithms can benefit from this representation.
The other side, receiving this transmission, now has to recover the tree from the sequence of numbers. I’ll let you work this one out – I actually think it’s a hard but decent interview question to ask someone after having explained the encoding side (maybe after you checked he can FizzBuzz). It can be done in O(n log(n)) time, and more efficient variations of the code allow decoding in O(n sqrt(log n)) time .
What’s huge about it is that it’s a bijection. That means it’s not only useful for encoding efficiently, but that (for example) it’s a great way to generate trees uniformly at random. Want a random 1,000,000 node tree to run your algorithm on? Great! Generate 999,998 numbers from 1 to 1,000,000 at random, and run them through Prüfer sequence decoding to get the tree.
Also note that in many cases, it’s still okay if your graph is cyclic: You can create a tree view of that graph by picking a node and doing BFS or DFS from that node.
What’s also surprisingly nice about Prüfer sequences is that they lend themselves very naturally to rooted forests, as well. A trivial extension is specifying you have R trees and there are N total nodes in the network. It can be proved that there are RNN-R-1 possible trees in the forest (in fact, the proof involves just demonstrating the bijection with Prüfer codes), and indeed following the exact same algorithm, keeping in mind to not delete the root nodes from the graph, and adding a symbol for “end of transmission” (or sending the number of words in the message at its beginning), we can transfer the whole forest. In fact, it becomes a very efficient way to transfer sparse forests: A forest with 1,000,000 trees that have no children is just transferred as “N=1000000, End” – Because all of the nodes are root nodes. If each tree has only 1 node, you end up transferring only the root nodes, 500,000 in number (assuming N=1,000,000).
For trees with a large branching factor, the sequence is sometimes also compressible with a general compression algorithm, since you are likely to repeat every parent node multiple times.
(*) Technically the number of children can only go up to N-1. But that hardly saves us anything
(**) I’m cheating again. Because transferring the names of all nodes is a permutation on N, it means transferring N! of data, whereas transferring NN-2 of data allows us to repeat the same term twice; And it is generally larger because ln(n!) = n*ln(n) – n + o(ln(n)) whereas ln(n(n-2)) = n*ln(n) – 2*ln(n).
One overlooked application of entity extraction (a fairly generalizable technique, it is now widely available in API form like OpenCalais, Parse.ly or AlchemyAPI) is to reduce the size of the feature space in NLP problems. Because of the curse of dimensionality (elucidated in the paramount text on machine learning, The Elements of Statistical Learning) classifiers generally perform worse as the number of features increases, and often this decline in performance dominates any gains due to applying more advanced algorithms such as neural networks, SVMs and so on. While the “data trumps algorithm” adage common in machine learning circles usually instructs us to bring in more training documents, with NLP more data corresponds to an increase in the dimensions of the feature space as well. That increase is normally roughly o(√ N ) where N is the corpus size (known as Heaps’ Law Heaps’ Law), and is bounded above by the size of the vocabulary of the language. However, the problems of increased dimensionality can be (very roughly) explained as requiring an exponential growth in data, so that the net effect is being far outweighed in favor of restricting the dimensionality. An added complication is that using n-grams as features, as is common in NLP, means the growth of the feature space with every new word is accentuated.
A solution has always been pruning – removal of the most frequent and least frequent tokens – but the costs incurred are that some meaning is lost. Using entity extraction, however, we have a solution where in the right context, no meaning is lost while a significant reduction in the size of the space is attained. Consider the classic “movie review sentiment” problem: Many of the tokens will be names of places, cast or names of other films, and while these do potentially carry predictive weight (for example, a film compared to “Citizen Kane” might usually suggest a positive review), the underlying hypothesis is that the sentiment extracted by the classifier is related to the language used rather than the subjects of comparison. In other words, what we would like the classifier to do is attach strong weights to the words “notably”, “influential” and perhaps “megamelodrama” in the sentence:
“But Mr. Cameron, who directed the megamelodrama “Titanic” and, more notably, several of the most influential science-fiction films of the past few decades (“The Terminator”, “Aliens” and “The Abyss”)…”
(excerpted from the New York Times’ review of “Avatar”), rather than drag in whatever score has been attached to other movie reviews citing The Terminator and Titanic or comparing to James Cameron. Instead, consider classifying
“But DIRECTOR, who directed the megamelodrama FILM and, more notably, several of the most influential science-fiction films of the past few decades (FILM, FILM and FILM)”
where we have eliminated 5 rare tokens for 2 common ones, preserving the meaning of the sentence at least insofar as the limited inferential power of a token-based classifier is concerned.
One may consider removing these terms altogether – however, consider that the meaning of sentences might be obscured, and most likely “notorious DIRECTOR” carries a different weight in a movie review than just the mention of a character known to be notorious in the outline of the film’s plot.
Thus entity extraction acts like a feature hashing technique where the feature space is reduced while terms with similar effect on the meaning of the sentence are bunched up together in a bin. The feature space, which is usually pervaded with unfrequent occurrences of multitudes of names and terms, the consequence of which is the inability to correctly infer score for n-grams such as “glorious Natalie” (compared to “glorious Angelina”, “glorious Keira” and so forth) is both reduced in size and enriched in more accurate probability estimates, at the cost of pre-processing the texts through an entity extraction algorithm.
Actual gains in accuracy are so widely varied by the type, quantity and quality of data, and the classification algorithm and parameters used, that I hesitate to provide any measure here. Suffice it to say I have gotten significant increases out of this before, where pruning beyond a point was losing meaning instead of helping proper estimation. As always in machine learning, this is a shot you take and test in a controlled fashion on unobserved data to inform yourself of its effectiveness. I would love to hear about your benchmarks using this technique in the comments!
If you do big data, you want to know about sparse matrices. In many contexts the data you have is represented as matrixes or tables, and in many cases, these have a lot of entries with a value of 0 (or no value which is imputed as zero – not always a benign practice). In machine learning, you’d run into these whenever you have features that are categorical (also known as factors) – i.e a feature that can take a value from a predefined set of values (for example, if you’re classifying cars, the type of the car, ‘Sports’, ‘Compact’ or ‘SUV’ might be a factor). These are usually encoded in the feature matrices used by your learning algorithms using a column per category, with a value of 1 if that instance (row) belongs to the category or 0 if it doesn’t. Thus, if the rows only belong to one category, each row will contain at least as many zeroes as the (number of categories – 1). Another common case is with natural language processing, where you tokenize huge documents and then count the occurrences (or use TF/IDF scoring) of the tokens in each document. You then end up with a dictionary of a few thousand or tens of thousands of terms (more so if you use n-grams), but each document only has a small percentage of these actually appearing. Another interesting case is when representing graphs in an adjacency matrix, whence for most networks the connectivity is pretty low, resulting with a very sparse matrix.
The reason you should care about this is because keeping these matrices in their full splendor will require huge amounts of memory: If you hold data for 100,000 documents with 10,000 features each (hardly an extreme case), each feature encoded as an 8-byte double, you have an 8Gb problem. If each document on average has only 5% of the features, there’s really just 400Mb of information there (although you’d need a constant multiple of this to actually hold the sparse representation – it is still a manageable amount). Even with various workarounds, usually you just can’t afford to start operating on these matrices in memory. The issue here is one of feasibility and not so much runtime, although you do get major performance gains out of using sparse matrices in many cases.
Performance in operations on sparse matrices will be attained by using production-level, field-tested C/Java libraries, and I do not recommend trying to implement these on your own – instead use MATLAB, Matrix-toolkit-java, SciPy, etc. But in case your code is written in some other platform (maybe Ruby) and you need to somehow get these matrices to your scientific code – you want to read the features from somewhere (like your DB, or generate them from documents you tokenize), create a matrix, and perhaps do rudimentary operations on it (selecting only a sub-matrix out of the whole matrix for splitting into cross-validation sets or doing sub-bagging, perhaps, or feature scaling). You want to be able to hold these matrices in memory and work with them.
The go-to book for using sparse matrixes is Direct Methods for Sparse Linear Systems. I highly recommend it – it’s an indispensable resource for all kinds of algorithms for using these matrices and contains the C code that is the essence of MATLAB’s sparse matrix mojo. Here I just briefly discuss the considerations and methods, but the book gives the details for moving between representations and operating on them.
So the main thing we’d like to deal with is how to represent these matrices, in memory or in a file. The natural idea that comes to mind is what is called a Triplet representation. In this, each non-zero entry is written down as a row index, column index and value. The resulting size of the representation is 3Z where Z is the number of non-zero entries in the matrix, a common factor for evaluating the scale and asymptotics of operations on sparse matrices. This format is trivial to implement.
However, it turns out it’s not the best format for many of the operations you commonly want to do. In particular you either end up having to use hashes to find your way in the matrix, or you have to scan the triplet list in order to operate on a specific row or column.
What does work well for almost all operations are CSC and CSR representations. These two are exactly the same – but one is geared towards row-operations and the other towards operating on columns. They both involve 3 arrays, two of size Z and one of size N which is one of the dimensions of your matrix (either the number of rows or the number of columns). So they turn out to actually be a little smaller than the triplet representation in most cases, and one of the arrays acts as a hash that guarantees that operations like picking out certain rows or certain columns are very easy to implement in an efficient manner. Consult better sources for understanding these representations, I’m just pointing them out.
Pat yourself on the back if you figured out you can go to below 2Z + N (or 2Z + M) by looking at the matrix as a 1 dimensional array with (N*M) entries and holding just 2 arrays, one with the non-zero values and one with the indices in the 1-dimensional array corresponding to these entries; But don’t bother implementing that. Saving the extra N or M of space is negligible compared to Z (in the 8Gb example above, it would amount to saving 40Kb or 400Kb), and you either have to access everything through a hash (which would end up taking up all the space you saved in the extra structure required) or scanning the whole array to get a single entry.
There is a fair number of standards around sparse matrix formats, such as the MatrixMarket format or Harwell Boeing. But the underlying implementation is always one of the three – Triplets, CSR or CSC, except when the matrix is known to be in a very specific form such as diagonal or band matrices, which is usually not relevant for general problems or machine learning problems in particular. As a case-in-point, standard machine learning libraries like LibSVM or LibLinear also take their input in a kind of CSR format. Also you might find that your problem requires special cases not handled by these formats, such as saving additional arrays for identifying the rows (row names) or the features (column names) and these need to be handled appropriately as well. Hence, unless you integrate specifically with software that uses one of these formats, you will probably be better off ignoring them and using whatever’s appropriate for you. You do want, however, a general library for reading the formats you decided on, working with them in memory and changing the representation between the formats. You can always convert from a basic triplet representation to any of the formats using SMC. I am putting together a little library with Ruby code for working with these representations up on GitHub. Feel free to contribute – Ruby seems to be lacking the support for these kinds of things.
There are more technicalities involved once you deal with really huge matrices. For example, you might not be able to guarantee that you can allocate consecutive arrays of size Z in memory or that indexes into them fit in an integer. The Ruby code ignores these cases but if you work at that scale these are things to consider – you will need to work around these issues, for instance by building each of the arrays in the representation from smaller chunks.
The issue with sparse matrices is mainly keeping them sparse and never representing them fully in memory. For example, mean-centering a sparse matrix is a bad idea, as you will lose the sparsity altogether. Changing a single zero entry of the matrix to non-zero (which is referred to sometimes as changing the sparsity pattern of the matrix) is hard in CSR or CSC and you are better off changing the representation to triplets, adding a bunch of triplet entries, and changing the matrix back to its compressed form. The same goes for initially building the matrix: Build it from your data source as a triplet matrix, and then compress it to gain the size and efficiency of operations. I recommend you read through my code, adapted from CSparse, to see how that conversion is done.
How do I know if my matrix is sparse?
There are many measures for the sparsity of a matrix but probably the most useful one is the simplest: The ratio of the number of entries that are zero, Z, to the total number of entries in the matrix, N*M. This gives us a measure in [0, 1] where 1 is the 0 matrix and 0 is a full matrix. Given the representations above, you can see that in terms of space it becomes worthwhile to hold the matrix as CSR or CSC once the sparsity is a little over 0.5 (in fact, 1/M or 1/N over 0.5), and in triplet format once it’s over 0.66. But in fact I would still treat these matrices as non-sparse. In fact, I would hold them as non-sparse as long as you can. When they start taking up 2Gb or more of memory, you’ll know you just have no choice but to use sparse representations (or move to bigger servers).