Structuring data analysis code

Structuring data analysis code: Use cases

Many data scientists, when working on an analysis, write code that is essentially one long script. It might look something like this:

# Hundreds of lines to read data 

# Hundreds of lines to clean data

# Hundreds of lines to calculate results

# Hundreds of lines to plot results

To many experienced coders who aren't data scientists, there's a huge problem here: To them, this code is a prime candidate for refactoring, in other words the code should be rewritten, without changing its functionality, before doing anything more with it. Many coders have learnt, through painful lessons, that refactoring is a big part in effectively writing software.

But the work of a data scientist is a specific type of coding. For instance, this analysis might never become part of a bigger piece of code. All that matters are the resulting graphs. So it's not immediately clear that this code needs to be refactored. At the least, we'd need good use cases before refactoring.

Here, I'm providing a few of these use cases. In doing so, let me also be clear that the code in the above style will, in some cases, actually be completely fine. Instead of trying to convince you that refactoring is always needed (it is not), I try to show useful coding patterns that are much easier with refactored code.

And please: If you know other patterns, or if you want to let me know why you agree or disagree, please do email me. If you like the refactored code and you want to learn more about how to get there, let me know as well, this can well make its own post.

A general structure for data analysis code

I've found, over years of using different approaches, that the following type of structured code is applicable to most of my analyses:

data_raw = get_data()
data_prepared = prepare_data(data_raw)
results = analyze_data(data_prepared)
plot_results(results)

Note: For the rest of this article, I'm using Python syntax, but all these points hold equally well for R or Julia (or whatever other language you are using for data science).

So far, the functionality is exactly identical to what we had before. So why did we do it? Because it makes it easier to change the code later. This is the general idea behind refactoring, and I've never seen it more concisely expressed than in Kent Beck's tweet:

for each desired change, make the change easy (warning: this may be hard), then make the easy change

So let's do some things that are now easy! In the rest of this post, I'll show patterns that will come in handy and that are easy once the code is refactored.

Sampling

We add a function that samples the raw data at a certain fraction. With this function written, let's make the analysis 10 times as fast by taking a 10% sample:

do_sample = True
data = sample_data(data_raw, frac=0.1) if do_sample else data_raw
data_prepared = prepare_data(data_raw)

Instead of just sampling the data, we also added a flag for sampling that we can turn on or off.

Of course, it would've been possible to sample without refactoring, but then it was more likely to lead to errors. Sampling makes you change your data, but if that data is accessible as a global variable, you need to be careful to not access the old version. In the refactored code, it's no longer possible to make such mistakes.

Running different versions of the analysis

Let's say your boss asks you to do the analysis by US region. Ok, how about:

results_all = {region: analyze_data(
    data_prepared.query("region==%s" %region))
    for region in regions}

Without refactoring, you have two quite ugly choices. Either you do this:

for region in regions:
    # Hundreds of lines to read data 

    # Hundreds of lines to clean data

    # Hundreds of lines to calculate results

    # Hundreds of lines to plot results

Or the following:

for region in regions:
    # Hundreds of lines to read data 

for region in regions:
    # Hundreds of lines to clean data

for region in regions:
    # Hundreds of lines to calculate results

for region in regions:
    # Hundreds of lines to plot results

These two choices are worse than ugly: They are likely to lead to errors. Each time you run some code, you must be careful to remember which region you had selected. Of course, almost all the time you will get this right, but since all your analysis is done inside this nested construct, you will run lots of analyses, enough to get it wrong occasionally.

Checking the results

It's useful to write automatic tests that check whether the data behaves the way you think it should behave. For instance, we might want to assert that, when we are analyzing state-by-state data, different parts of the data have the same number of states:

assert results['num_states'] == results['data_states'].shape[0]

If your code is all just in one level in a file, you will need to make sure that every time you run the analysis, you also run the checks. And that is easy to forget! In the refactored version, this check can be used as follows:

def check_results(results):
    assert results['num_states'] == results['state_results'].shape[0]

def analyze(data_prepared)
    # The dictionary storing the results
    results = {}

    # Calculating everything in the results
    results['effects'] = analysis.treatment_effects(data_prepared)
    results['num_states'] = analysis.number_states(data_prepared)
    results['state_results'] = analysis.state_trends(data_prepared)
    
    # Now that results are fully assembled, check whether they pass sanity checks
    check_results(results)  
   
    return results

This makes it impossible to not check the data. Your function will crash in case anything is wrong with your analysis.

A conclusion, and a look back in time

Structuring your code into functions will make you a happier data scientist. But let me be clear about what I am not saying: I am not saying, your code should immediately be wrapped up into nice little functions by the time you start your analysis. Instead, it can make sense to write a script in which you have access to all your data, and you can run many checks on it before deciding how to actually do the analysis.

That's the starting point. Once you do know better how your analysis works, you can start by wrapping up certain steps into their own functions, one by one, until the code resembles the final structure. This is tedious for the first few times you do it (after that, you get much better at it), but it's usually worth the cost: At the least, it makes it easier to save lots of analysis time by sampling and it's going to make it more likely that your analysis is not plain wrong.

Much of what I learned about refactoring comes from the painful experience of working with unstructured code (written by others, but also written by me). I'm trained as an economist, and thus have spent a fair part of my life coding in Stata - it's simply the predominant statistical tool in economics.

While Stata is a wonderful tool for a few specific tasks, it is deficient as a tool to write structured data analysis code, because the language does not support functions in a way that resembles the functions of high-level coding languages. Without functions, it's not possible to encapsulate a data pipeline into clean components. And that makes it harder to test any analysis. Much of research in the social sciences is done using tools like Stata. This should worry you.

If you want to learn more about refactoring data code, here are some good resources::

Ben Skrainka: Correctness in Data Science (slideshow) - he gives strategies to make it more likely that your analysis is correct. Unsurprisingly, a big aspect is to build clean data pipelines and test them. Without refactored code, this is a hopeless task.
r-dir: Folder structure for Data Analysis
Nice R Code: Designing projects