Data science is communication

I am a data scientist, but I am not sure my training has prepared me for what this job is about.

It's clear that being effective at my job requires technical skill. I need to be able to clean data and use sensible statistical models to make decisions of that data. I might be working (as I have) on a model to predict fraud, where I am provided with training and test data, and where I'll be rewarded based on my ability to predict fraud in the test sample. This is a technical task, the type of task I've spend years learning about. If I'm not done learning, it's easy to read more, just by going on Hacker News.

But, in my experience, much of data science is different. Let me step back: In many companies, data science is a service provided to other areas of the company. It's a service that doesn't come cheap. Given this, my goal as a data scientist is to make sure my work is used. To make that happen, it's almost necessary that my colleagues understand the value of data science. This challenge is not about technology, it's about communication.

It's easy to forget about this when there's a huge number of technical problems to take on. I like the technical part of my job. I like writing code, but even more I like refactoring. I love it if I can transform convoluted code into a few clean and simple lines. I like using statistical models that fit the problem. But, ultimately, all that will matter only if the recipient of my analysis likes using it.

Some rules of thumb

How to make this more likely? There's no silver bullet, but in my experience the following works well:

Simple almost always wins

Let's say I work on a model to predict customer churn. Clearly, it matters that the model is good at detecting churn. At the outset, this seems like a great technical challenge, for which I'm well suited: I've learned to use logistic models, random forests, and neural networks. I've done well in Kaggle competitions.

However, I've come to realize that all the predictive power of the model is not likely to decide the fate of the project. In practice, the best choice usually is to use the simplest models. The more complicated the model, the harder it is to understand what's going on, and that makes it harder for decision-makers to see how the model provides value. If, however, I estimate the churn model using a linear regression and a small set of features, I get a model that's easy to explain and reason about.

There are exceptions to the rule that simple models are best. For instance:

Facebook's timeline algorithm
Netflix's algorithm to predict the categories of movies you like
Google's ad placement

These all have in common that they are applied on huge scale, a scale at which small improvements in the algorithm's performance have enormous value. Further, they are the result of years spent iteratively improving algorithms that started much simpler than they are today.

Talk your way out of an imperfect specification

A technician's dream is to have a well-specified problem to work on. For instance, an agile coding process requires that all tasks are split into small pieces and then specified sufficiently to allow a coder to complete it according to specification.

While this is probably a good process for building software (I'm not enough of a software engineer to have a strong view), my feeling is that it won't work as well for data science. Most data science tasks are incompletely specified. They come in as a request like "build a good model to predict churn", and I'm not sure there's a good way to make this request fully specified.

Instead, my goal is to let the stakeholders know how I'm approaching the request and what the costs and benefits of my time are. Here's one way to express this: "Within a week, we can have a first model that'll help us understand what's driving churn. But if you want to generate automatic reports from this model, it'll take at least another week". And if that changes drastically, I need to update whoever is involved with the project.

My experience is that it's hard to over-communicate these tradeoffs (at the same time, it's easy to over-communicate how I am doing the analysis). For me, a rule of thumb is that I shouldn't go longer than a week without discussing my work with the business stakeholder. The biggest reason to talk (or email) is that it's costly and frustrating to work on things that aren't wanted. But there's a nice added benefit: It's that by talking often, it's easier to build excitement for the work.

Speaking the language

Each industry has its own language, just as statistics and computer science do. That's a recipe for confusion. Fortunately, there is one language every business has in common, and that is money. This means that we can do better than evaluating a model purely on statistical metrics. I'm not done until I can estimate how much money this model will generate. Estimating this is hard, and often requires wild assumptions. But I rather make these assumptions explicitly, because if I don't, others will need to make them implicitly. And then it's much harder to decide whether the model is good or not.

There is more to finding a common language: It also means learning about the product, because this makes it easier to tell everyone else in the company about the work you are doing. Even if the models don't require product knowledge (sometimes they do, sometimes they don't), they will often be used by either co-workers or clients who care none about the statistics, but all about the product.

To conclude

All this matters for being effective. But it's about more than that: It's about making it easier for me to be part of a team that provides value and be acknowledged for it too. At work it's hard to find something better.

Thanks to Peter Cohen for feedback and suggestions.