Why your experiment’s impact is probably greater than you think

When we experiment, we try and address hypotheses we believe are blockers (or enablers) for conversion success. Eg if your signup form for a service converts 65%, one can imagine many blockers for the 35% of users that don’t convert:

  • They don’t understand the offer;
  • They don’t trust you;
  • It’s too long to fill out;
  • It requests details they don’t feel comfortable providing;
  • They get stuck unable to answer one of the questions (there’s an answer you hadn’t considered);
  • It’s too slow;
  • It’s buggy;
  • They don’t feel it’s valuable for them to complete this stage

When we experiment, we try and address one such blocker and remove it. Let’s imagine we believe (or used qualitative research to find out) that some people find the form too long. You experiment with removing a few fields – but find it only grows conversions rate by 0.5%. Does that mean the solution of that problem was only an issue for 0.5% of your users? No.

Multiple blockers will cause first few fixes to register as less impact, with increasing impact as we work more on the same blocker, until it plateaus. What happens is that most people are not affected by just one blocker to completing a process/funnel. They might both feel it’s too long and not trust the process, or not trust the process and not understand what we’re asking. As a result, when you start fixing the ‘trust’ issues, if you didn’t touch anything else yet, you can only affect the people who only have the trust issue. Even if you “fixed” trust completely, there are still blockers for the rest of the population. As you remove the next blocker, say “understanding”, you now reap the rewards of both fixing the trust issue beforehand, and of the understanding blocker. The diagram illustrates the dynamic at play here, with hypotheses are called “Trust”, “Time” (I don’t have time to fill this out), “Price” and “Understand” (I don’t understand why I need it).

You can see many dynamics are possible, including something you think you’ve solved (stopped being a bottleneck) becoming once more a bottleneck after other things are solved. Users/clients also change preference, or the audience of users changes and we are exposed to people with a different mix of preferences, resulting in shifting the “blockers”. All of this is to say that it’s important to revisit hypotheses you think you’ve solved before, and it’s important to acknowledge that you won’t see all the fruits of every effort immediately, but sometimes a successful ‘clearing’ of the path to user success is actually reliant on a lot of previous blockers that were removed in a way that was impossible to measure. Essentially – every experiment’s impact is tempered by all the other issues so that you see only a part of its effect.

The false dichotomy of Quant-Qual and the real tradeoff for data-driven product development

Some people intuitively assume (perhaps inspired by the ‘rational-humanist’ dichotomy) that there’s a spectrum with taking data-driven decisions on one hand, and ‘qualitative’/intuitive/user-feedback-driven insights on the other. That’s incorrect. Data-driven product development doesn’t trade off qualitative insights: The best teams I’ve seen (e.g, at Facebook) are integrating both. Qualitative, case-by-case feedback from user interviews provides the observations around which we build hypotheses. We then test these with the data. Using only data will leave us blind to most phenomena, and using just qualitative insights will drive us randomly without a sense of magnitude, hence priority, of the opportunity. Many of my best optimizations in growth came from observing users use the product – then measuring the scale of a problem they encounter. First round just posted a great discussion of the value of qualitative research.

There is one key trade-off to being data driven, however. That trade-off is speed. It both directly costs resources to invest in measuring the state and the effect of almost everything you do, and it slows you down indirectly because data-driven decision making takes time: You need to gather evidence from queries (DB, surveys, experiments, etc.), and then process it with decision-makers – that’s a lot of work. This is why we also want to be wise and measured about what to measure.

“Not everything you can count counts, and not everything that counts is countable”.

Some concrete examples:

A) ‘Fail fast’ in experiments: stop experiments once they show either:

i. There’s a better outcome – even before we have narrow confidence intervals around how much better it is [1].

ii. The confidence intervals are narrow enough to show us the magnitude of the effect is not worthwhile pursuing. If an experiment happens to show we can move time spent between minus one second to adding 3 seconds per user per week, who cares?

B) Don’t try and measure the most metrics, or sometimes the most correct version of a metric, or sometimes even the outcome of every decision. Arbitrary precision requires arbitrary time: Avoid analysis paralysis and focus on the few decisions that matter. Jeff Bezos says it well here: There are only a few decisions that matter for which you want to invest your time in making correctly.

[1] This is the opposite of the practice researchers do for a scientific journal publication (except under the minimization of harm doctrine in some human/animal-subject studies)

Organizing research into seven output types

This post is about a range of options for “deliverables” – types of output – from applied researchers in the industry. I found it helpful to be very explicit with people you hire and when prospecting projects about what the output of your research would be, what the different types trade off, and which of these you expect from researchers on your team (or the business units you work with expect from you).

A key point for me is that in the context of a business need for impact, these outputs are found on different spots along two axes: Friction to impact, and Leverage.

Screenshot 2017-11-25 21.50.22

By ‘friction to impact‘, I mean how much work is needed to actually have business impact from providing this deliverable. A report needs to be read, understood and a decision or action taken by its audience to be impactful; In the drawer it helps no one. A dashboard needs to be regularly used, while a fully automated, integrated system (e.g, some recommendation algorithm) already delivers its impact once integrated into the product.

Leverage, on the other hand, describes some sort of scope that the research solves. A very generic method can be applied to hundreds of cases; An SDK, used across many products (at least with the same programming language) – but a single analysis is limited to one decision at a set point in time (though that decision may turn out to be quite large!)

While not comprehensive, these deliverable types cover much of the output from a wide variety of substantive research topics and disciplines:

A) Descriptive analysis

This type is most similar to a published paper: It summarizes an exploration into a topic. It rests on driving a decision as its mode of delivering impact. Because often other people will make the actual decision and the changes it supports, the most important thing about a descriptive analysis is communicating it well. The expectation, should you end your project with a descriptive analysis, is that you are able to convince other people to take action with it. It doesn’t matter how sophisticated the method (in fact, a great deal of making the decision easier to take is that the method is not sophisticated where unnecessary, or decision-makers will justly suspect a type of method-dependence). It doesn’t matter how long and thorough it is (again, the summarized delivery must be incredibly short). A lot of what matters is the delivery.

Examples are an analysis highlighting a new way to measure the business unit that is more aligned with the actual goal it’s trying to achieve, or highlighting an opportunity that was missed in the data, or describing a new process, or deciding between multiple strategic options, etc.

B) A Method

This is also often encapsulated in a paper, but this output aims to generalize an approach to many possibly datasets or topics. A new method might have enormous potential impact. Imagine you found a more efficient sorting algorithm – but until patented or utilized on a substantive topic of interest, the method remains un-impactful for the business, however impactful it might be on the progress of science. The responsibility is with the researcher to ensure that many people repeatedly adopt the new method. This may be easier to bring about than the change that a descriptive analysis is trying to make, if the method is a widely applicable, practical one. This requires a lot of insight into what others are missing and an interest in persuading others to use the new method in their research.

C) A Dataset

A new dataset may be the result of a research, either through collection of new data or some preliminary analysis and transformation. It needs to be used in another analysis, dashboard or other output to become impactful.

D) A Dashboard

The automation of a descriptive analysis is a dashboard. It is only impactful if people use the dashboard, and so the dashboard needs to be useful: It needs to be easy to understand and use, quick to load, comprehensive in the insight it provides, and be an important part of the process that people undertake on an ongoing basis. The dashboard can be a new or more refined measurement, a monitor helping a new process happen efficiently, providing ongoing insight, etc. You trade the ‘one-off’ nature of a descriptive analysis for continuous ongoing impact, at the cost of effort: Creating the analysis in more robust code that is able to apply it automatically from new data as a report (possibly an interactive one).

E) A Framework

An API, code library/package or generalized executable which isn’t integrated in a product. Like a Method (B) embedded in code, this is again a step closer to leveraging the impact of your research but doesn’t, unless used within other systems, actually generate impact. However, the ease of integration and the general applicability of an API solving a set of problems create enormous potential for impact with highly reduced friction compared to only a method: With a framework, you don’t need to read a complex paper, and then implement it: You just call a function in an SDK or a web API to get the application of the research within your product.

F) A Proof-of-concept

The proof of concept is a step on the way to a standalone system (G), that is nevertheless not providing the impact by itself, but meant to demonstrate that it is possible to build a standalone automated system with the desired impact. In some cases, the scale of impact as a POC is tiny, and a scalable automated system is needed to grow it. In either case, a POC is similar to building a method in the sense that if no standalone automated system is actually built around the POC’s code, no business impact is achieved. It’s hard work to convince other researchers to use your method in their research, and it’s hard to convince business stakeholders to take the action you propose in your descriptive analysis, or use a dashboard you built. But my experience is it’s probably even harder to cobble an engineering team to build your POC into a product/system, without very strong buy-in and explicit understanding in advance that the scope of your work is going to be a POC. There are many labs, teams and companies organizing around researchers building POCs and engineers building them into systems [1]. In my experience, this was the root of several organizational problems and I strongly and expressly avoid POCs as an accepted form of output for my own research teams: All other forms are legit, except I don’t find POCs are ever useful except as a milestone the researcher then goes on to turn into one of the other forms — usually a framework or a standalone/integrated automated system.

G) A Standalone, automated system

This research output is a piece of code, either standalone or (preferably) fully integrated with a key system/product that utilizes some research product to deliver impact. This is often the most impactful form of research, because it both increases the frequency and duration of the impact, and reduces the barriers to adoption to virtually zero. In the classic examples – e.g a classifier making decisions by itself, or a system automatically exploring a space of experiments, or an algorithm deciding the optimal ad bid – it essentially automates and streamlines the decision-making that humans might do with dashboards or by reviewing analyses. It does away with a lot of the dependencies to achieving impact relied upon in all of the previous forms of research: The need to communicate, convince people, or rely on their ongoing use of your research, and instead takes on the responsibility to deliver the impact itself.

Screenshot 2017-11-25 21.50.22

In organizing these outputs, I think two axes are interesting: One is the potential scope of impact, or “reach” of the output. The other is the barriers/friction to creating the impact, and how dependent it is on others or the researcher’s future action. These are somewhat but not entirely inverse. The actual impact also, of course, widely varies with the substantive topic at hand, and of course depends on the research being correct, relevant, successful, etc. So for instance, it is perfectly sensible to choose to work on, and drive more impact from a “low-reach”, one-off descriptive analysis of how to use your company’s spare cash than a complicated method that makes some specific computations more efficient by 0.1% in rare cases, even though in my theory I place ‘methods’ as some of the highest-potential-scope output. But when several of these are applicable, and in the same topic, this may explain why I prefer a system with its “0 friction to impact” and decent reach, over a Proof-of-Concept with only marginally more potential applications and much more friction.

A great applied (industry) research organization constantly drives to further the fulcrum its projects create – driving towards bigger scope, generalized outputs, while simultaneously successfully impacting the business by driving the friction down or taking accountability of and ensuring the impact is realized, often with followup work by the same researcher.

Another aspect that I’ve ignored so far, is the amount of effort required. I avoid ordering these outputs by effort, however – I don’t see how to argue about it coherently nor is it the same for different people with their different preferences and skills. The only general case I see is that a POC must take less effort than a full system, by definition. For the rest, the effort required is on different, incomparable scales.

The key point I try to make is a conscious choice of output needs to be made – and very clearly communicated – as a researcher or manager, and it needs to be explicitly understood by your colleagues and business partners. Significant difficulties that researchers, their managers and their clients experience, in situations when they are all new to the relationship, arise when the researcher expected to build a POC when the business partner wanted a standalone automated system, or the researcher wanted to work on methods published as papers when their managers wanted the team to build frameworks. So I hope this post gave you a language to explore these.

[1] In some cases in the industry, there’s a separation and violation of the original meanings of roles: Engineers were sophisticated implementors figuring out solutions to practical problems – like landing vehicles on the moon or devising steam engines. They were doing applied research, and they definitely didn’t either abstain from the learning and experimentation of research nor the drive to build a fully working solution. I find in some cases in tech we’ve created two classes of people that want to do either one (exploration) or the other (building proven things to spec) but not both, and the result is a problem of division of labor (that I might discuss in another post).

The Measured Life of 1944

My grandfather, Ernest Friedlander, was the quintessential Yekke [1] engineer at a time when their culture played a major part in Israel. He was trained in Germany, like so many of the engineers and architects of that time. In fact, the story goes that when a talk was given at the assembly of engineers in Tel Aviv the audience would not even notice when the speaker would mention “… That method, devised in Eintausendneunhundertzwanzigsieben, is used in…” – interposing the German numbers in the Hebrew text.

My grandfather passed away long before I was born, but only recently, when shuffling through his old papers did my father encounter this – and it immediately brought back vivid memories of his childhood.


The graph, done by hand, is a bit hard to make out with the faded labels and shorthand German. Nevertheless, a single clue almost entirely solves the puzzle: At the very top left corner, encapsulated by a box, it reads ‘8.7.44, 3600g’. That date happens to be my father’s date of birth. And 3.6Kg (8 lbs) was his weight at birth. The amazing part is when you then realize what Ernest had done. He weighed my father daily since he was brought home, a week after he was born. That is the red line. He had also weighed, three times a day, the feeding of my father – both by Brust (breastfed) and then with a supplement, marked red. This constitutes the upper half of each of the pages in the chart. The black line, then, is the net weight, without the food. Finally, he had calculated and drew a regression (trend) line to track the increase in weight. The four pages here are just a small sample my father had framed, recording the first 11 Woche (weeks). There are quite a few more, going up until he stopped breastfeeding. A remarkably tedious and meticulous job, carried on with no break for long months – this is exactly the defining characteristic of the Yekke. But also, it is a reminder how the advanced data culture that we live in today is in fact nothing new. Life was carefully measured even before this guy came around.

Lessons about the history of visualization and the way an engineer expresses his care and devotion aside, for me this discovery was touching because it hints that my fascination with statistics and data – quite an anomaly in my family – might in fact be some sort of trait carried onward by my grandfather’s genes on to me, a part of him that I carry on in my every day life.

[1] a culture of Jewish German nationals who emigrated to Israel mostly in the 30’s, escaping the rapidly Nazifying Germany. Having been mostly secular and deeply entwined in the German middle-class of the time, they are stereotyped as being pedantic, punctual and so pragmatic and rational as to seem unaffectionate.

The EMC Center of Excellence in Israel Launches the First Local Big Data Science Competition of its Kind

EMC Data Science Competition Poster

The EMC Israel COE has launched a Big Data competition open to the Israeli data science community on Kaggle.com, the winner of which will receive a cash prize of $10,000. The competition, which will run until August 2012, is geared towards individuals, groups (of up to five people) and startup companies, and is aimed at increasing awareness of Big Data and the data science profession in particular, while contributing to the creation of new algorithms. EMC Israel invites all those with a background/experience in machine learning, mathematics, statistics, computing, economics and physics – and any other interested parties, to try their luck in solving the challenge that awaits them at the site. Those who enter the competition will receive a real database from an open source code containing thousands of files; the challenge is based on the automatic identification of content, and the prize will go to the party that comes up with the ideal algorithm.

On Compressing Trees with Prüfer Codes

Imagine you’re a lumberjack, and you want to ship a whole forest to some far away place. You’d want to make sure you pack it as tight as you can. No, wait. Imagine you’re a software engineer and you want to transfer a huge forest (just a bunch of trees, or rooted undirected acyclic graphs, that is) across the Internet. How would you do it efficiently?

Let’s make it clear that we’re looking for a way to transfer the tree structure. That is separate from transferring the contents of the nodes. If the trees you’re representing are, for example, social network influences (e.g tweets emitted from a certain person and retweeted by his followers, and theirs, and so on), you can keep the names of the persons involved, the tweet and the time it was retweeted at in a separate message. Then you map each node to a number from 1 to the size of your tree (say, N), and our question is how to transfer these numbers in a way that the other side can understand that node number 3 is a child of node number 11, and 14 is a child of 3, for example.

So let’s start with one tree. Some trivial ideas come to mind, like writing down a node, then the number of children it has, then the identifiers for the child nodes, their number of children, and so forth. You must make a decision to rebuild the tree either depth-first or breadth-first. That’s a fine representation. But it turns out to be a little wasteful. You’re transferring every node (say, out of N nodes), and then you’re also writing down the number of children that each node has. So that’s 2N pieces of information, each of which can take N different values: The number of nodes is comparable to the number of children in the worst case (a tree with 1 level, all nodes being leaves whose parent is the root), so you can’t go much lower than that. You can come up with various tricks to optimize this trivial representation, but we should ask another question instead.

Counting trees

How much can we compress a general tree structure? The question is really, “how many trees are there?”. If we know how many trees there are, and we find a representation whose size is comparable to the number of trees, then we know we have the minimal representation: We can’t go smaller than that, because if there are less words (meaning possible instances of our representation) than there are trees, there will be no way of telling apart two trees that are represented with the same word. If however we end up sending more data than there are trees, then we’re not optimal. What we’re looking for, then, is a bijection between trees and possible words in our representation.

If you’re familiar with Graph Theory, you know that Cayley’s formula for trees says there are NN-2 possible trees with N nodes. If we can have a language with NN-2 words, each representing a different tree, we’ve gotten as lean as we can get. The 2N pieces of information from before have N possible values for each piece (since both the node identifiers themselves and the number of children can be any number from 1 to N*). They represent a space of N2N options, then – much, much larger than the space of trees. Can we find a code that represents the tree using N-2 words with N options each? Think about it – that means using even less than the identifiers of all nodes** – since there are N nodes, but we transfer just N-2 of them, in some order, and get our tree back.

This bijection exists (in fact, several useful variations do), and it’s called Prüfer codes. In a Prüfer code, we assume we both know what N is (or I can just send it as my first word). I then give you N-2 numbers from 1 to N and they tell you how to build the tree. How do I do it? I take a tree, say, (a) from the figure below.

Trees on 5 nodes

Now I take the largest leaf, write down its parent, and remove it. The largest leaf is 4 (5 is not a leaf. 2 and 3 are leaves smaller than 4). I write down its parent’s identifier, “1”, and remove 4 and the edge connecting it to 1 from the tree. The next largest leaf is 3. (Only 2 and 3 are leaves now). I write down its parent, “5”, and remove 3 and the node connecting it to 5. Now only 2 is a leaf, whose parent is 1. Writing “1” down again, we have “1,5,1” so far. The tree is down to just 5 and 1. That’s it. The tree is a connected graph, so the must be an edge between them, and we’re done: Our code for this tree is “1,5,1”. The number of rooted trees, those with a distinguishable root (like I drew them above) is NN-1 – since for every tree I can arbitrarily decide which of the nodes is the root and it’ll make it into a rooted tree: So there are N options for each of the NN-2 trees, for a grand total of NN-1. To transfer a rooted tree, I could’ve then either specifically stated that 5 is my root (say, by specifying the root as the first word, something like “5:1,5,1”) or alternatively I could’ve continued one more step with the Prüfer sequence: Since I have only 2 nodes remaining, taking the parent of the largest leaf is always equivalent to specifying the root. So that’s how you get the Prüfer code for a tree: The code for trees (b) and (c) turns out to be “4,4,1,2” and “2,2,2,2” respectively – Work it to see you follow the idea.

Note that these trees are not ordered: That is, we don’t have a discernible “right” and “left” child, so not all algorithms can benefit from this representation.

The other side, receiving this transmission, now has to recover the tree from the sequence of numbers. I’ll let you work this one out – I actually think it’s a hard but decent interview question to ask someone after having explained the encoding side (maybe after you checked he can FizzBuzz). It can be done in O(n log(n)) time, and more efficient variations of the code allow decoding in O(n sqrt(log n)) time [1].


What’s huge about it is that it’s a bijection. That means it’s not only useful for encoding efficiently, but that (for example) it’s a great way to generate trees uniformly at random. Want a random 1,000,000 node tree to run your algorithm on? Great! Generate 999,998 numbers from 1 to 1,000,000 at random, and run them through Prüfer sequence decoding to get the tree.

Also note that in many cases, it’s still okay if your graph is cyclic: You can create a tree view of that graph by picking a node and doing BFS or DFS from that node.

What’s also surprisingly nice about Prüfer sequences is that they lend themselves very naturally to rooted forests, as well. A trivial extension is specifying you have R trees and there are N total nodes in the network. It can be proved that there are RNN-R-1 possible trees in the forest (in fact, the proof involves just demonstrating the bijection with Prüfer codes), and indeed following the exact same algorithm, keeping in mind to not delete the root nodes from the graph, and adding a symbol for “end of transmission” (or sending the number of words in the message at its beginning), we can transfer the whole forest. In fact, it becomes a very efficient way to transfer sparse forests: A forest with 1,000,000 trees that have no children is just transferred as “N=1000000, End” – Because all of the nodes are root nodes. If each tree has only 1 node, you end up transferring only the root nodes, 500,000 in number (assuming N=1,000,000).

For trees with a large branching factor, the sequence is sometimes also compressible with a general compression algorithm, since you are likely to repeat every parent node multiple times.

(*) Technically the number of children can only go up to N-1. But that hardly saves us anything
(**) I’m cheating again. Because transferring the names of all nodes is a permutation on N, it means transferring N! of data, whereas transferring NN-2 of data allows us to repeat the same term twice; And it is generally larger because ln(n!) = n*ln(n) – n + o(ln(n)) whereas ln(n(n-2)) = n*ln(n) – 2*ln(n).

Dimensionality Reduction with Entity Extraction


One overlooked application of entity extraction (a fairly generalizable technique, it is now widely available in API form like OpenCalais, Parse.ly or AlchemyAPI) is to reduce the size of the feature space in NLP problems. Because of the curse of dimensionality (elucidated in the paramount text on machine learning, The Elements of Statistical Learning) classifiers generally perform worse as the number of features increases, and often this decline in performance dominates any gains due to applying more advanced algorithms such as neural networks, SVMs and so on. While the “data trumps algorithm” adage common in machine learning circles usually instructs us to bring in more training documents, with NLP more data corresponds to an increase in the dimensions of the feature space as well. That increase is normally roughly o( N ) where N is the corpus size (known as Heaps’ Law Heaps’ Law), and is bounded above by the size of the vocabulary of the language. However, the problems of increased dimensionality can be (very roughly) explained as requiring an exponential growth in data, so that the net effect is being far outweighed in favor of restricting the dimensionality. An added complication is that using n-grams as features, as is common in NLP, means the growth of the feature space with every new word is accentuated.

A solution has always been pruning – removal of the most frequent and least frequent tokens – but the costs incurred are that some meaning is lost. Using entity extraction, however, we have a solution where in the right context, no meaning is lost while a significant reduction in the size of the space is attained. Consider the classic “movie review sentiment” problem: Many of the tokens will be names of places, cast or names of other films, and while these do potentially carry predictive weight (for example, a film compared to “Citizen Kane” might usually suggest a positive review), the underlying hypothesis is that the sentiment extracted by the classifier is related to the language used rather than the subjects of comparison. In other words, what we would like the classifier to do is attach strong weights to the words “notably”, “influential” and perhaps “megamelodrama” in the sentence:

“But Mr. Cameron, who directed the megamelodrama “Titanic” and, more notably, several of the most influential science-fiction films of the past few decades (“The Terminator”, “Aliens” and “The Abyss”)…”

(excerpted from the New York Times’ review of “Avatar”), rather than drag in whatever score has been attached to other movie reviews citing The Terminator and Titanic or comparing to James Cameron. Instead, consider classifying

“But DIRECTOR, who directed the megamelodrama FILM and, more notably, several of the most influential science-fiction films of the past few decades (FILM, FILM and FILM)”

where we have eliminated 5 rare tokens for 2 common ones, preserving the meaning of the sentence at least insofar as the limited inferential power of a token-based classifier is concerned.

One may consider removing these terms altogether – however, consider that the meaning of sentences might be obscured, and most likely “notorious DIRECTOR” carries a different weight in a movie review than just the mention of a character known to be notorious in the outline of the film’s plot.

Thus entity extraction acts like a feature hashing technique where the feature space is reduced while terms with similar effect on the meaning of the sentence are bunched up together in a bin. The feature space, which is usually pervaded with unfrequent occurrences of multitudes of names and terms, the consequence of which is the inability to correctly infer score for n-grams such as “glorious Natalie” (compared to “glorious Angelina”, “glorious Keira” and so forth) is both reduced in size and enriched in more accurate probability estimates, at the cost of pre-processing the texts through an entity extraction algorithm.

Actual gains in accuracy are so widely varied by the type, quantity and quality of data, and the classification algorithm and parameters used, that I hesitate to provide any measure here. Suffice it to say I have gotten significant increases out of this before, where pruning beyond a point was losing meaning instead of helping proper estimation. As always in machine learning, this is a shot you take and test in a controlled fashion on unobserved data to inform yourself of its effectiveness. I would love to hear about your benchmarks using this technique in the comments!