in a world where no one cares

Silly micro-optimization: branching vs. dynamic dispatch

Once upon a time, I wrote a PHP script to export some data as CSV. If memory serves, it looked something like this:

<?php

class ExportController
{
    private $fooRepo;

    public function __construct($fooRepo)
    {
        $this->fooRepo = fooRepo;
    }

    public function get()
    {
        $data = $this->fooRepo->getRecords();
        echo "Time,Column 1,Column 2,Column 3\n";
        foreach ($record in $data)
        {
            printf("%d,%s,%s,%s\n",
                   $record->time, $record->a, $record->b, $record->c);
        }
    }
}

Later on, I needed to add an output variant: instead of simply blasting out the literal time value from the record (a Unix timestamp), I needed to format it in a specific way. So I did it the simple way, and bodged it in with a flag and a condition:

<?php

class ExportController
{
    /* Constants for the different output formats. */
    /* Output record times as Unix timestamps. */
    const FORMAT_TIME_UNIX = 'unix';
    /* Output record times as human-readable, multi-column times. */
    const FORMAT_TIME_HUMAN = 'human';
    /* Default output format is Unix timestamps. */
    private $output_format = self::FORMAT_TIME_UNIX;
    private $fooRepo;

    public function __construct($fooRepo)
    {
        $this->fooRepo = fooRepo;
    }

    public function get()
    {
        $data = $this->fooRepo->getRecords();

        /* Produce the format-dependent CSV header. */
        if ($this->output_format === self::FORMAT_TIME_HUMAN)
        {
            echo 'Year,Month,Day,Hour,Minute,Second';
        }
        else
        {
            echo 'Time';
        }
        echo ",Column 1,Column 2,Column 3\n";

        foreach ($record in $data)
        {
            /* Produce the format-dependent timestamp. */
            if ($this->output_format === self::FORMAT_TIME_HUMAN)
            {
                echo (new \DateTime('@' . $record->time))->format('Y,m,d,H,i,s'));
            }
            else
            {
                echo $record->time;
            }

            printf(",%s,%s,%s\n", $record->a, $record->b, $record->c);
        }
    }
}

Later still, while refactoring another part of this service, I rearranged this code to use a more OOP approach: a formatter interface, and two different implementations. One of those implementations is constructed before output begins, and all formatting operations are delegated to that instance:

<?php

class ExportController
{
    /* Constants for the different output formats. */
    /* Output record times as Unix timestamps. */
    const FORMAT_TIME_UNIX = 'unix';
    /* Output record times as human-readable, multi-column times. */
    const FORMAT_TIME_HUMAN = 'human';
    /* Default output format is Unix timestamps. */
    private $output_format = self::FORMAT_TIME_UNIX;
    private $fooRepo;

    public function __construct($fooRepo)
    {
        $this->fooRepo = fooRepo;
    }

    public function get()
    {
        $data = $this->fooRepo->getRecords();

        /* Pick the formatter based on the output format. */
        if ($this->output_format === self::FORMAT_TIME_HUMAN)
        {
            $output_formatter = new UnixTimestampRecordFormatter();
        }
        else
        {
            $output_formatter = new HumanTimeRecordFormatter();
        }
        /* Delegate output generation to the formatter, and just print whatever
         * it produces. */
        echo $output_formatter->header();
        foreach ($record in $data)
        {
            echo $output_formatter->format($record);
        }
    }
}

interface RecordFormatter
{
    /* Shared between both formatter implementations. */
    const COLUMNS = ",Column 1,Column 2,Column 3\n";

    /* The header is constant per output format, so the method doesn't need any
     * parameters. */
    public function header();
    public function format($record);
}

class UnixTimestampRecordFormatter implements RecordFormatter 
{
    public function header()
    {
        return 'Time' . RecordFormatter::COLUMNS;
    }

    public function format($record)
    {
        return sprintf("%d,%s,%s,%s\n",
                       $record->time, $record->a, $record->b, $record->c);
    }
}

class HumanTimeRecordFormatter implements RecordFormatter 
{
    public function header()
    {
        return 'Year,Month,Day,Hour,Minute,Second' . RecordFormatter::COLUMNS;
    }

    public function format($record)
    {
        return sprintf("%s,%s,%s,%s\n",
                       (new \DateTime('@' . $record->time))->format('Y,m,d,H,i,s'),
                       $record->a, $record->b, $record->c);
    }
}

I did this for legibility reasons, not performance. But the idea that eliminating a branch might’ve helped with performance did cross my mind. So that’s the idea of this post: to understand what the performance impact was of replacing a (constant per request) branch with a call to a function pointer.

First, let’s set the ground rules. I’m only interested in the relative difference in the compute time cost of invoking either of the two formatting operations. This refactoring does make some other performance-impacting choices, like returning strings instead of printing them directly. In PHP, which is a garbage-collected language, this might increase the amount of garbage generated. I don’t know, and I’m not interested in that detail; maybe another day. Same story for the date and time formatting: generating the human-readable timestamp is going to be way more costly than the simple integer-to-string conversion required for the timestamp output. I am, however, interested in a cross-language comparison: how do these patterns weigh against each other in C? In Java? Python?

In short, then, I’m interested in the time difference between:

for i in range(0, BIG_NUMBER):
  if flag:
    op2()
  else:
    op1()

And:

op = op2 if flag else op1
for i in range(0, BIG_NUMBER):
  op()

I’ll provide (identical) implementations of op1 and op2 so that they do some menial, branchless work. I don’t want them to do anything which is likely to take a variable amount of time, but I don’t want them to be no-ops, either, lest they get compiled away or inlined. Simply generating a random number should be sufficient, I think.

Let’s take some bets. Regardless of whether the language sports JIT compilation, the first option should be a best-case for branch prediction. The branch condition is effectively constant, at least for any given instance of the controller. If androids dream of anything other than electric sheep then they probably dream of keeping their processor pipelines full, and this should certainly do that.

But the second case is fully branchless (at least it is as far as my own, high-level code is concerned – the runtime may have different ideas). The cost involved there is the cost of indirection: that terrible, horrible, no good, very bad dynamic dispatch which has caused C++ developers to shake their fists at us flippant whippersnappers for decades.

My bet, personally, is that the second option is going to edge out the first, but only narrowly. Probably less than a percent difference.

Let’s get into it.

All tests are best-of-three and were run on an Ubuntu 18.04 VM with 4 cores of a Ryzen 3700X and 8 GB RAM.

Edit, 2020-08-05 18:39 UTC: Note that the results are given in operations per millisecond, not elapsed time. A higher number is better.

C

Branching:

#define _POSIX_C_SOURCE 199309L

#include <errno.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define TEST_ITERATIONS 100000000

typedef enum Op
{
    OP1,
    OP2,
    OP_COUNT
} Op;

void __attribute__ ((noinline)) op1()
{
    asm ("");
    rand();
}

void __attribute__ ((noinline)) op2() 
{
    asm ("");
    rand();
}

int main(int argc, char *argv[]) 
{
    if (argc <  2)
    {
        fprintf(stderr, "usage: %s op1|op2\n", argv[0]);
        return 1;
    }

    Op which_op;
    if (strcmp(argv[1], "op2") == 0)
    {
        which_op = OP2;
    }
    else
    {
        which_op = OP1;
    }

    srand(time(NULL));

    struct timespec start, finish;
    if (clock_gettime(CLOCK_MONOTONIC, &start) < 0)
    {
        fprintf(stderr, "error getting clock: %s\n", strerror(errno));
        return 1;
    }

    for (int_fast32_t i = 0; i < TEST_ITERATIONS; ++i)
    {
        if (which_op == OP2)
        {
            op2();
        }
        else
        {
            op1();
        }
    }

    if (clock_gettime(CLOCK_MONOTONIC, &finish) < 0)
    {
        fprintf(stderr, "error getting clock: %s\n", strerror(errno));
        return 1;
    }

    int64_t elapsed = (finish.tv_sec - start.tv_sec) * 1000000000L
                     + (finish.tv_nsec - start.tv_nsec);
    printf("Performed %i operations in %" PRIi64 " ns: %0.1lf operations/ms.\n",
           TEST_ITERATIONS,
           elapsed,
           (TEST_ITERATIONS / (double) elapsed) * 1000000);

    return 0;
}

Dynamic dispatch (function pointers):

#define _POSIX_C_SOURCE 199309L

#include <errno.h>
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define TEST_ITERATIONS 100000000

typedef void(*Op)();

void __attribute__ ((noinline)) op1()
{
    asm ("");
    rand();
}

void __attribute__ ((noinline)) op2() 
{
    asm ("");
    rand();
}

int main(int argc, char *argv[]) 
{
    if (argc <  2)
    {
        fprintf(stderr, "usage: %s op1|op2\n", argv[0]);
        return 1;
    }

    Op which_op;
    if (strcmp(argv[1], "op2") == 0)
    {
        which_op = op2;
    }
    else
    {
        which_op = op1;
    }

    srand(time(NULL));

    struct timespec start, finish;
    if (clock_gettime(CLOCK_MONOTONIC, &start) < 0)
    {
        fprintf(stderr, "error getting clock: %s\n", strerror(errno));
        return 1;
    }

    for (int_fast32_t i = 0; i < TEST_ITERATIONS; ++i)
    {
        which_op();
    }

    if (clock_gettime(CLOCK_MONOTONIC, &finish) < 0)
    {
        fprintf(stderr, "error getting clock: %s\n", strerror(errno));
        return 1;
    }

    int64_t elapsed = (finish.tv_sec - start.tv_sec) * 1000000000L
                     + (finish.tv_nsec - start.tv_nsec);
    printf("Performed %i operations in %" PRIi64 " ns: %0.1lf operations/ms.\n",
           TEST_ITERATIONS,
           elapsed,
           (TEST_ITERATIONS / (double) elapsed) * 1000000);

    return 0;
}

Both were compiled with the Ubuntu packaging of GCC 7.5.0, and with the flags -Wall -Wextra -pedantic -std=gnu99. Per Compiler Explorer, enabling any optimization level normally caused the function calls to be optimized out (not just inlined – the rand gets optimized away entirely), so I left it off annotated the functions such that they won’t be inlined (thanks, Ameisen on reddit). Keep in mind that I’m most interested in relative comparisons within the same language, so these numbers should still be relatively comparable.

Versionop1 (ops/ms)op2 (ops/ms)
branching132,747.0132,903.7
function pointer141,583.4 (+6.7%)142,129.2 (+6.9%)
branching (-O2)157,462.4158,095.2
function pointer (-O2)158,309.8 (+0.5%)158,369.1 (+0.2%)
branching (-Os)152,897.6146,882.1
function pointer (-Os)158,408.8 (+3.6%)152,162.8 (+3.5%)

While I’m not surprised that there’s a very slight (and consistent) difference between op1 and op2 for the branching version, the function pointer version should be identical. Comparing between those two measurements ought to give some idea of the noise level of the samples.

Java

Branching:

import java.util.Random;

public class Branching
{
    private static final int TEST_ITERATIONS = 100000000;
    private final Random random = new Random();

    private static enum Op
    {
        OP1,
        OP2
    }

    private void op1()
    {
        this.random.nextInt();
    }

    private void op2()
    {
        this.random.nextInt();
    }

    public void run(String[] args)
    {
        if (args.length < 1)
        {
            System.err.printf("usage: java Branching op1|op2\n");
            System.exit(1);
        }

        Op whichOp;
        if (args[0].equals("op2"))
        {
            whichOp = Op.OP2;
        }
        else
        {
            whichOp = Op.OP1;
        }

        long start = System.nanoTime();
        for (int i = 0; i < TEST_ITERATIONS; ++i)
        {
            if (whichOp == Op.OP2)
            {
                op2();
            }
           else
           {
                op1();
            }
        }
        long finish = System.nanoTime();
        long elapsed = finish - start;
        System.out.printf("Performed %d operations in %d ns: %.1f "
                        + "operations/ms.\n",
                          TEST_ITERATIONS,
                          elapsed,
                          (TEST_ITERATIONS / (double) elapsed) * 1000000);
    }

    public static void main(String[] args)
    {
        new Branching().run(args);
    }
}

Dynamic dispatch (interfaces):

import java.util.Random;

public class DD
{
    private static final int TEST_ITERATIONS = 100000000;

    private static interface Op
    {
        void op();
    }

    private static class Op1 implements Op
    {
        private final Random random = new Random();

        public void op()
        {
            this.random.nextInt();
        }
    }

    private static class Op2 implements Op
    {
        private final Random random = new Random();

        public void op()
        {
            this.random.nextInt();
        }
    }

    public void run(String[] args)
    {
        if (args.length < 1)
        {
            System.err.printf("usage: java Branching op1|op2\n");
            System.exit(1);
        }

        Op whichOp;
        if (args[0].equals("op2"))
        {
            whichOp = new Op2();
        }
        else
        {
            whichOp = new Op1();
        }

        long start = System.nanoTime();
        for (int i = 0; i < TEST_ITERATIONS; ++i)
        {
            whichOp.op();
        }
        long finish = System.nanoTime();
        long elapsed = finish - start;
        System.out.printf("Performed %d operations in %d ns: %.1f "
                        + "operations/ms.\n",
                          TEST_ITERATIONS,
                          elapsed,
                          (TEST_ITERATIONS / (double) elapsed) * 1000000);
    }

    public static void main(String[] args)
    {
        new DD().run(args);
    }
}

Both were compiled and run on OpenJDK 1.8.0_252. Note the minor hoop-jumping to ensure that both versions use an instance of Random stored in an instance variable. This is because calling a static instance (as might be the obvious approach for the branching version) seems to slow things down by a few k-ops/ms.

Versionop1 (ops/ms)op2 (ops/ms)
branching258,962.2250,437.0
dynamic dispatch274,835.7 (+6.1%)261,796.9 (+4.5%)

As an aside, I’m astonished at how much faster Java generates random numbers than glibc. Even compiling the C versions with -O2 only brings them up to ~160k-ops/ms – about 60% of the Java versions. I think all four versions are relatively idiomatic. Big JIT energy, I guess. Something to delve into later.

Python

Branching:

#! /usr/bin/env python3

import random
import sys
import time

TEST_ITERATIONS = 10000000

def op1():
    random.random()

def op2():
    random.random()

def main(argv):
    if len(argv) < 2:
        print('usage: ./branching.py op1|op2', file=sys.stderr)
        return 1

    if argv[1] == 'op2':
        which_op = 2
    else:
        which_op = 1

    random.seed()

    start = time.time()
    for _ in range(0, TEST_ITERATIONS):
        if which_op == 2:
            op2()
        else:
            op1()
    finish = time.time()
    elapsed = finish - start
    print(f"Performed {TEST_ITERATIONS} operations in {elapsed:.3f} s: {TEST_ITERATIONS / elapsed / 1000:.1f} operations/ms.")

if __name__ == '__main__':
    sys.exit(main(sys.argv))

Dynamic dispatch (first-class functions):

#! /usr/bin/env python3

import random
import sys
import time

TEST_ITERATIONS = 10000000

def op1():
    random.random()

def op2():
    random.random()

def main(argv):
    if len(argv) < 2:
        print('usage: ./dd.py op1|op2', file=sys.stderr)
        return 1

    if argv[1] == 'op2':
        which_op = op2
    else:
        which_op = op1

    random.seed()

    start = time.time()
    for _ in range(0, TEST_ITERATIONS):
        which_op()
    finish = time.time()
    elapsed = finish - start
    print(f"Performed {TEST_ITERATIONS} operations in {elapsed:.3f} s: {TEST_ITERATIONS / elapsed / 1000:.1f} operations/ms.")

if __name__ == '__main__':
    sys.exit(main(sys.argv))

Python’s dynamic typing and first-class functions lend it well to this style of calling. This is the first case we’ve had where the dynamic dispatch version is unambiguously simpler to implement than branching.

Both versions were run under Python 3.6.9.

Versionop1 (ops/ms)op2 (ops/ms)
branching10,875.210,390.2
dynamic dispatch13,321.3 (+22.5%)13,291.7 (+27.9%)

Dynamic dispatch is unambiguously faster, by a huge margin. In fact, this branching version is my second attempt. My initial naïve take branched on an enum, the same as in the C and Java versions. However, just looking up the enum values on each loop iteration carried a massive performance hit. Regardless of whether I defined the enum values manually or with enum.auto(), the branching version barely cleared 6k-ops/ms. This leads me to believe that the performance impact is was caused by the field lookup itself, and not by the difference in comparing integers versus object identity. Maybe I’m being unfair to my own evaluation by making this optimization, but it also feels like it might be a bit incendiary to claim a ~120% efficiency improvement in Python by using dynamic dispatch over branching.

This also really brings into focus what people mean when they say that Python is a slow language. This is a full order of magnitude slower than what the straightforward, unoptimized C implementation accomplished. I wasn’t expecting that for something as simple as this. There was a really neat guessing game on this very point which did the rounds a couple months ago, but I guess I haven’t really internalized what it demonstrates.

PHP

Finally, we’re back where we started.

Branching:

#! /usr/bin/env php
<?php

define('TEST_ITERATIONS', 10000000);

define('OP1', 1);
define('OP2', 2);

function op1()
{
    rand();
}

function op2() 
{
    rand();
}

function main($argv)
{
    if (count($argv) < 2)
    {
        fprintf(STDERR, "usage: ./branching.php op1|op2\n");
        return 1;
    }

    if ($argv[1] === 'op2')
    {
        $which_op = OP2;
    }
    else
    {
        $which_op = OP1;
    }

    $start = hrtime(true);
    for ($i = 0; $i < TEST_ITERATIONS; ++$i)
    {
        if ($which_op === OP2)
        {
            op2();
        }
        else
        {
            op1();
        }
    }
    $finish = hrtime(true);
    $elapsed = $finish - $start;
    printf("Performed %i operations in %ld ns: %0.1lf operations/ms.\n",
           TEST_ITERATIONS,
           $elapsed,
           (TEST_ITERATIONS / (double) $elapsed) * 1000000);

    return 0;
}

if (php_sapi_name() === 'cli')
{
    exit(main($argv));
}

Dynamic dispatch (callables/call_user_func()):

#! /usr/bin/env php
<?php

define('TEST_ITERATIONS', 10000000);

function op1()
{
    rand();
}

function op2() 
{
    rand();
}

function main($argv)
{
    if (count($argv) < 2)
    {
        fprintf(STDERR, "usage: ./dd.php op1|op2\n");
        return 1;
    }

    if ($argv[1] === 'op2')
    {
        $which_op = 'op2';
    }
    else
    {
        $which_op = 'op1';
    }

    $start = hrtime(true);
    for ($i = 0; $i < TEST_ITERATIONS; ++$i)
    {
        call_user_func($which_op);
    }
    $finish = hrtime(true);
    $elapsed = $finish - $start;
    printf("Performed %i operations in %ld ns: %0.1lf operations/ms.\n",
           TEST_ITERATIONS,
           $elapsed,
           (TEST_ITERATIONS / (double) $elapsed) * 1000000);

    return 0;
}

if (php_sapi_name() === 'cli')
{
    exit(main($argv));
}

PHP does have first-class functions, but they’re reserved for anonymous functions/closures. However, you can invoke a regular function by name by calling call_user_func().

Ubuntu 18.04 packages PHP 7.2, but I wanted to use 7.3 or higher so that I could use hrtime. Consequently, I installed PHP 7.4 from Ondřej Surý’s PPA.

Versionop1 (ops/ms)op2 (ops/ms)
branching39,911.036,988.2
dynamic dispatch26,283.4 (-34.1%)26,383.9 (-28.7%)

Wow – that is an awful result! I bet the overhead is due to having to look up the function by name every time. (Like I said in the introduction, it’s only my code which is branchless – in this case, the runtime will be performing many branches indeed.) Variable functions are about ~1k-ops/ms slower, which makes sense: in addition to looking up the function by name, the interpreter will also need to figure out what type the variable is and how to invoke it. Let’s try again, but we’ll implement op1 and op2 as anonymous functions, as invocable classes, and as classes implementing an interface (à la Java).

Dynamic dispatch (anonymous functions):

#! /usr/bin/env php
<?php

define('TEST_ITERATIONS', 10000000);

$op1 = function ()
{
    rand();
};

$op2 = function () 
{
    rand();
};

function main($argv)
{
    global $op1, $op2;

    if (count($argv) < 2)
    {
        fprintf(STDERR, "usage: ./dd-anon.php op1|op2\n");
        return 1;
    }

    if ($argv[1] === 'op2')
    {
        $which_op = $op2;
    }
    else
    {
        $which_op = $op1;
    }

    $start = hrtime(true);
    for ($i = 0; $i < TEST_ITERATIONS; ++$i)
    {
        $which_op();
    }
    $finish = hrtime(true);
    $elapsed = $finish - $start;
    printf("Performed %i operations in %ld ns: %0.1lf operations/ms.\n",
           TEST_ITERATIONS,
           $elapsed,
           (TEST_ITERATIONS / (double) $elapsed) * 1000000);

    return 0;
}

if (php_sapi_name() === 'cli')
{
    exit(main($argv));
}

Dynamic dispatch (invocable classes):

#! /usr/bin/env php
<?php

define('TEST_ITERATIONS', 10000000);

class Op1
{
    function __invoke()
    {
        rand();
    }
};

class Op2 
{
    function __invoke()
    {
        rand();
    }
};

function main($argv)
{
    if (count($argv) < 2)
    {
        fprintf(STDERR, "usage: ./dd-invoke.php op1|op2\n");
        return 1;
    }

    if ($argv[1] === 'op2')
    {
        $which_op = new Op2();
    }
    else
    {
        $which_op = new Op1();
    }

    $start = hrtime(true);
    for ($i = 0; $i < TEST_ITERATIONS; ++$i)
    {
        $which_op();
    }
    $finish = hrtime(true);
    $elapsed = $finish - $start;
    printf("Performed %i operations in %ld ns: %0.1lf operations/ms.\n",
           TEST_ITERATIONS,
           $elapsed,
           (TEST_ITERATIONS / (double) $elapsed) * 1000000);

    return 0;
}

if (php_sapi_name() === 'cli')
{
    exit(main($argv));
}

Dynamic dispatch (interfaces):

#! /usr/bin/env php
<?php

define('TEST_ITERATIONS', 10000000);

interface Op
{
    function op();
}

class Op1 implements Op
{
    function op()
    {
        rand();
    }
};

class Op2 implements Op 
{
    function op()
    {
        rand();
    }
};

function main($argv)
{
    if (count($argv) < 2)
    {
        fprintf(STDERR, "usage: ./dd-interface.php op1|op2\n");
        return 1;
    }

    if ($argv[1] === 'op2')
    {
        $which_op = new Op2();
    }
    else
    {
        $which_op = new Op1();
    }

    $start = hrtime(true);
    for ($i = 0; $i < TEST_ITERATIONS; ++$i)
    {
        $which_op->op();
    }
    $finish = hrtime(true);
    $elapsed = $finish - $start;
    printf("Performed %i operations in %ld ns: %0.1lf operations/ms.\n",
           TEST_ITERATIONS,
           $elapsed,
           (TEST_ITERATIONS / (double) $elapsed) * 1000000);

    return 0;
}

if (php_sapi_name() === 'cli')
{
    exit(main($argv));
}
Versionop1 (ops/ms)op2 (ops/ms)
branching39,911.036,988.2
dynamic dispatch (call_user_func())26,283.4 (-34.1%)26,383.9 (-28.7%)
dynamic dispatch (anonymous functions)35,520.7 (-11.0%)35,405.3 (-4.2%)
dynamic dispatch (invokable classes)34,471.4 (-13.6%)33,950.5 (-8.2%)
dynamic dispatch (interfaces)45,319.1 (+13.6%)45,310.2 (+22.5%)

Wild. PHP has always been a little notorious for its performance pitfalls, but I would never have guessed that the happy path for performance in this respect was so narrow – nor that the difference would have been this pronounced. I’m pretty pleased to find that the original code which brought me down this rabbit hole fits into the better-performing option, but it’s only by fluke.

Summary

In all cases, the performance gap was wider than I expected. I’ve always heard that branching is slow, but I’ve also always been told the same of indirection, and I didn’t have much gut feel for the relative difference. I wouldn’t put too much stock in any of my absolute numbers; performance of each version in each language varied by as much as 1-2 k-ops/ms every time I ran them (except for the PHP ones, which were remarkably stable by comparison, and usually varied by under 1 k-ops/ms). Still, the difference is measurable, and in all cases well outside of the parts-per-thousand range I’d expected to find.

Thanks for reading along. See you next time, and remember to benchmark your assumptions!

Write a Comment

Comment

  1. if ($this->output_format === self::FORMAT_TIME_HUMAN)
    foreach …
    print fmt 1
    else
    foreach …
    print fmt 2

    • Yeah, certainly valid where you can invert the program structure like that. In the case of my original project, I did do it that way for a little while, but there was per-record processing involved in the inner loop, and I felt like it reduced readability to have two all-but-identical copies of the for-loops.