Refactoring outside-in

Published in

Helsing Blog

8 min readOct 17, 2023

Refactoring is “a controlled technique for improving the design of an existing code base” [Fowler]. Since much has been written about code smells and refactoring techniques in the micro (eg, books and entire websites), I want to zoom out and discuss how and in what order one should apply those techniques. Specifically, I will argue that refactoring is best done outside-in, that is, starting with the external API boundary and then working our way inwards towards classes, methods, algorithms, types, tests, or variable names.

The code examples in this blog post use Rust, but the outside-in technique of course also applies to other programming languages. As usual, a strong type system helps with refactoring — hence our choice of Rust.

Outside-in. Source: Wikimedia [CC BY 2.0].

Running example

Inspired by Stefan’s Refactoring Rust workshop at EuroRust 2023, let’s assume we have in our code base a very convoluted parse_magic_key function that iterates over files in a directory, finds all JSON files, parses the files for particular keys, and returns all corresponding values. Maybe this function was at some point written for a prototype or during some hackweek, and maybe it has since grown to hundreds of lines; here’s an abridged version of what this wild function might look like:

use std::fs;
use std::fs::File;
use std::io::{BufRead, BufReader};

/// Parses and returns the `magic_key` fields of all JSON files int he given directory.
pub fn parse_magic_key(path: &str) -> Vec<String> {
    let mut results = Vec::new();
    for path in fs::read_dir(path).unwrap() {
        let path = path.unwrap().path();
        if path.extension().unwrap() == "json" {
            // Here, you'd find hundreds of lines of convoluted/ugly/brittle parsing logic,
            // simplified here into just a few lines.
            let file = BufReader::new(File::open(path).unwrap());
            for line in file.lines() {
                let line = line.unwrap();
                let parts: Vec<&str> = line.split(":").collect();
                if parts.len() == 2 && parts.get(0).unwrap().contains("magic_key") {
                    results.push((*parts.get(1).unwrap()).trim().to_string());
                }
            }
        }
    }

    results
}

#[cfg(test)]
mod tests {
    use std::fs;
    use tempdir::TempDir;
    use crate::parser::parse_magic_key;

    #[test]
    fn can_parse_dir() {
        let tmp_dir = TempDir::new("").unwrap();
        let dir = tmp_dir.as_ref();
        fs::write(dir.join("positive.json"), r#"
            {
              "magic_key": "1"
            }"#).unwrap();
        fs::write(dir.join("negative.json"), r#"
            {
              "foo": "2"
            }"#).unwrap();
        fs::write(dir.join("negative.not-json"), r#"
            {
              "magic_key": "3"
            }"#).unwrap();

        assert_eq!(parse_magic_key(dir.to_str().unwrap()), vec![r#""1""#]);
    }
}

There are of course more things wrong with this code than right. For example:

Naïve/buggy/brittle parsing logic
No reasonable error handling
Hard to understand what the expected behaviour is
Entangled directory traversal and parsing logic
Hard-coded and hard-to-extend JSON format assumption
Hard to extend
Inflexible and unclear API (eg, what is path: &str exactly?)
O(|result|) space complexity

Say we want to refactor this code. The natural first question is, where do we start? Do I start with the small and obvious things, for example finding better variable and methods names, splitting the method into multiple smaller methods? Do I start by fixing the parsing logic, maybe by using serde_json instead of the broken hand-jammed parser? Or do I start with the API, for example the types or the method signature?

If we squint a little, we can identify two directions for our refactoring effort: inside-out and outside-in.

Inside-out or outside-in?

Inside-out refactoring starts with the implementation details: we would start by fixing the parsing logic (and adding tests) and finding better variable names. Next, maybe we’ll add support for formats other than JSON. And finally, when we’re happy with internals, we fix the API.

Hang on, why do we need to fix the API? And how do we know how to fix it? Well, any reasonable refactoring effort, I argue, must start with a hypothesis and design goals: we are refactoring the code because ____. NB, because the code base is ugly is almost never a useful goal. More likely, we are are refactoring the code base because its design or implementation details are no longer compatible with our requirements and use-cases. For example:

We need a more flexible API in order to reuse the module in a different context or application.
In our code snippet: because we pass in the directory Path, it is not possible to parse only some of the JSON files; it's all or nothing.
The design imposes performance restrictions that are no longer feasible.
In our code snippet: since the entire result (Vec<String>) must fit into main memory, we cannot parse large volumes of data.

Often, such changes in requirements translate directly to changes in the API of the module or function. This gives rise to the idea of outside-in refactoring: we start by changing the API and then work inwards until we are happy with the implementation. This idea has previously been described — albeit with more enterprise-y vocabulary and scope — as the strangler fig pattern.

Outside-in refactoring

Step 0: Understand the old API and implementation

Unless you can afford the liberty of greenfield development, you will have to support all the intended and often even the unintended behaviour of the old API and implementation. Then — keeping Chesterton’s fence in mind — enumerate the aspects of the old API which you can truly do without. By identifying constraints or call patterns that are no longer used and needed, we can frequently produce significantly simpler systems.

Step 1: Design the new API

Once we’ve analysed the old API and understand what we need to keep, what we can throw out, and what we need to add or modify, it’s time to write down the new API. In our example, the new API might look like this:

pub fn parse(files: impl IntoIterator<Item=PathBuf>)
    -> impl Iterator<Item=String> {
  todo!()
}

By changing the input parameter from directory path to iterator over input files, the caller can now control which files should be parsed. Second, the function returns an iterator over the results; this makes it easier for the caller consume results in a streaming fashion with bounded memory. Of course this is not the right API design for all use-cases, but let’s assume for the purpose of this exercise that this is what we need. (I’ve glossed over error handling to keep the example manageable.)

Step 2: Implement the new API using the old code

The next step is to implement the new API using the old code with as few changes as possible. This is important; the goal of this step is not to write a perfect new implementation, the goal is to reuse the existing implementation and adapt it to the new API. In our example, this adaption could look as follows:

/// Parses and returns the `magic_key` fields of all JSON files int he given directory.
/// TODO: error handling
pub fn parse(files: impl IntoIterator<Item=PathBuf>)
    -> impl Iterator<Item=String> {
    // TODO: Parse incrementally rather than collecting all results into a Vec
    let mut results = vec![];
    for path in files {
        if let Ok(file) = File::open(&path) {
            let extension = path.extension().map(|p| p.to_str()).flatten().expect("Failed to determine file extension");
            let file_results = match extension {
                "json" => parse_json(file),
                // TODO: support YAML
                // TODO: support XML
                _ => {
                    println!("Unknown extension: {extension}");
                    vec![]
                }
            };

            results.extend(file_results);
        }
    }

    results.into_iter()
}

fn parse_json(read: impl Read) -> Vec<String> {
    let reader = BufReader::new(read);
    let mut results = Vec::new();
    for line in reader.lines() {
        let line = line.unwrap();
        let parts: Vec<&str> = line.split(":").collect();
        // TODO: use streaming JSON parser
        // TODO: make parsing logic less brittle
        if parts.len() == 2 && parts.get(0).unwrap().contains("magic_key") {
            results.push((*parts.get(1).unwrap()).trim().to_string());
        }
    }

    results
}

/// Parses and returns the `magic_key` fields of all JSON files int he given directory.
pub fn parse_magic_key(path: &str) -> Vec<String> {
    let mut results = Vec::new();
    for path in fs::read_dir(path).unwrap() {
        let path = path.unwrap().path();
        if path.extension().unwrap() == "json" {
            let file = File::open(path).unwrap();
            results.append(&mut parse_json(file));
        }
    }

    results
}

We have done the following:

Extract the parsing code into a new, private parse_json method
Rewrite the original parse_magic_key method using the extracted parse_json method
Implement the new parse method using the extracted parse_json method
Add TODOs wherever we took shortcuts and noticed things we need to refactor or optimise later

Note that the API, behaviour, and tests of the original parse_magic_key method are unchanged.

Step 3: Write tests for new API

Next, we can start writing a few tests for the new API, for example:

    #[test]
    fn can_parse_json() {
        assert_eq!(parse_json(r#"
            {
              "magic_key": "1"
            }"#.as_bytes()), vec![r#""1""#]);

        assert!(parse_json(r#"
            {
              "foo": "1"
            }"#.as_bytes()).is_empty());
    }

    #[test]
    fn can_parse_files() {
        let tmp_dir = TempDir::new("").unwrap();
        let dir = tmp_dir.as_ref();
        fs::write(dir.join("positive.json"), r#"
            {
              "magic_key": "1"
            }"#).unwrap();
        fs::write(dir.join("negative.json"), r#"
            {
              "foo": "2"
            }"#).unwrap();
        fs::write(dir.join("negative.not-json"), r#"
            {
              "magic_key": "3"
            }"#).unwrap();

        let files = vec![dir.join("positive.json"), dir.join("negative.json"), dir.join("negative.not-json")];
        let results: Vec<String> = parse(files).collect();
        assert_eq!(results, vec![r#""1""#]);
    }

Step 4: Fix the internals

Finally, we can fix all the TODOs we have collected along the way (eg, ensure the parsing logic is correct, support for other file types). I want to call out in particular that the initial implementation in Step 2 still has the same memory requirements as the original implementation; however, the key point is that the API allows us to make it more efficient later.

Note that steps 1–2–3–4 can be done in independent commits or even merge requests. Most of the TODOs in Step 4 can be done in parallel. They can be independently reviewed and our build is green after every step. All previous invocations of the parse methods still work. It’s a very adiabatic way of changing code.

Once you’re done refactoring you can migrate all API consumers and eventually delete the old API and its implementation.

But why

We have seen how to refactor outside-in, but haven’t discussed why. There is an “internal” and an “external” reason why outside-in refactoring usually beats inside-out refactoring, in my experience.

“Internally”, the new API that we design in Step 1 sets the goal posts for the rest of the refactoring work in the sense that we can be reasonably confident that we’re done refactoring when we have found a sensible implementation for the API. After all, the new API should already encode all existing plus the anticipated new usage patterns (eg, incremental parsing) and constraints (eg, bounded memory) for our library or application.

For example, since we’re asked to produce an impl Iterator<Item=String> rather than a Vec<String>, we can target an implementation that uses bounded memory. This will guide our implementation towards a streaming implementation or an LL parser that mostly reads forwards. In addition, we still have the tests (and maybe benchmarks) for the “old” behaviour that we can exercise in order to avoid regressions.

The “external” aspect is even more interesting. By writing down the new API first, we can (partially) decouple the refactoring work from the progress of developers who want to use the new API. As soon as the new API is merged, other team members can start playing with the new API. In the best case, they can make independent progress on their part of the code base (using the new API); more often they will have feedback on the API because something isn’t quite right. (NB: That’s agile without an agile coach.)

Conversely, you have almost certainly witnessed the problems of the inside-out approach: “The new API isn’t quite ready yet, because we are first refactoring the internals of the algorithm and the database layout. We’ll let you know when this is done, probably in a week or two [Sure!].”

Conclusion

I hope I have convinced you to start your next refactoring endeavour outside-in, by writing the API first and dealing with the “rest” afterwards. As always in software engineering, this method is not a panacea, but I encourage you to give it a go.

Author: Robert Fink