Cost of Abstraction: Integer Average as an Example

By Steffen Hirschmann · December 4, 2024

Cost of Abstraction: Integer Average as an Example

**Sometimes the most straightforward way to do things is not the fastest.**

We often experience that abstractions do not only cost resources (RAM, ROM, runtime), they can also make expressing a solution more complicated than necessary. In this blog post, we will look at the seemingly trivial problem of averaging two unsigned integers as an example. We will examine solutions in C and C++, before looking at actual hardware features that help us to express the solution much more directly. This will highlight the importance of understanding the limitations of programming languages and abstractions in general as well as the benefit of digging deeper than the language sometimes allows.

The problem formulation is simple: Average two unsigned integers in a programming language with fixed-width integer like C, C++, etc. We do not care about rounding. So if we want the average of 25 and 12, both, 18 or 19 are fine as answers. Why is this seemingly trivial problem relevant? All kinds of algorithms need to average integers, for example binary search. Of course, you could just use C++20’s std::midpoint, or any big integer library of your choice.¹ This blog post looks at the task from the different perspective of how a higher abstraction level sometimes hides features available in the levels below.

This is a typical computer science exercise: It’s about the boundary cases. Let’s say we have a processor based on a 32-bit ARM architecture (ARM32), i.e. 32-bit native integers. What happens, if we try to find the average of 2³¹ - 1 and 2³¹ + 1? The mathematical result is 2³¹. But blindly adding these numbers will overflow and in unsigned arithmetic we’ll get the sum modulo 2³². That value is of no particular use to us without post-processing. So let’s do it right from the start and avoid overflows altogether.

If you search for solutions to this problem online, you will find many solutions ranging from dividing first, then adding and fixing up the solution, e.g. (a / 2) + (b / 2) + (a & b & 1), to calculating sign and magnitude of the distance from a to b, e.g. libstdc++’s std::midpoint. But I want to draw your attention to one particular solution. The solution proposed in the binary search blog post mentioned above assumes that you only have positive numbers in a signed integer. Then you can simply do the straight forward thing: ((uint32_t) a + (uint32_t) b) / 2. Why does this work? Because a and b have at most 31 bits (positive number inside a signed integer type). Adding them as unsigned integers gives us a (31+1)-bit number, i.e. one that fits into 32 bits.

Can we extend this idea to the full 32-bit range? Yes, we can – at least in assembly. A 32-bit addition in hardware actually generates a 33-bit result. The 32nd bit is called the “carry bit”. This bit signals unsigned overflow, and occasionally it can also be used to perform 33-bit arithmetic. Many instruction sets have operations to shift the carry bit into a register. Considering that a right shift equals a division by two, this is the perfect instruction for our problem.

The following is a solution for our ARM32 processor (uint32_t inputs in r0 and r1, result in r0, the r2 register is used for the lower 32 bits of the intermediary result):

; carry:r2 := r0 + r1 (33 bit result, 32nd bit in “carry”)
add r2, r0, r1
; r0 := (carry << 31) | (r2 >> 1)
rrx r0, r2

We add both numbers (add), which implicitly sets certain status flags inside the processor, including the carry bit. Then shift right by 1, which is an unsigned division by 2, while putting the carry into the most significant bit (rrx; cf. “Rotate Right with Extend”, § F5.1.165 in the Arm® Architecture Reference Manual for A-profile architecture). That’s literally the problem formulation: Add and divide. And in contrast to the 31-bit solution from above, it works with the full 32-bit number range. The same idea works on various architectures, including x86-64.

C/C++ compilers like GCC and Clang are well-known to have pattern-matching functionality that optimizes complex expressions or even several statements into more concise machine code. A prototypical example is the recognition of popcount. Can a clever C/C++ compiler save our integer-average day? No, unfortunately it cannot. If we add two signed integers and the result overflows, our program is not well-defined anymore. If we add two unsigned integers and the result overflows, the compiler must assume we meant to have the result modulo 2³² (in case of 32-bit unsigned integers), because of the wide contract of unsigned integers in C and C++.

Abstractions are great. They let us express concepts without referring to implementation-level details like registers (r0, r1, r2). But we know that our machine uses two’s complement, its adders generate a 33-bit result, and it can shift in the carry flag. Why can we not make use of it? Because all this information gets lost when abstracting away the hardware in high-level languages. GCC and Clang do their best to hide pain points by recognizing algorithmic formulations that have a simple counterpart in assembly (like popcount). But they can only do this under certain circumstances. We saw what they cannot fix for us in this blog post: the language specification simply demands otherwise. In fact, in our example, a high-level language makes our program less concise. There are many more programs that can benefit from expressing parts of it in assembly, because the hardware has specific support for them. Some examples for ARM32: Cyclic Redundancy Checks, converting byte orders (reversing the bytes in a word), dealing with bit-fields, or saturating integers (clamping it to range 2^N-1 and -2^N). This phenomenon is not limited to the abstractions that programming languages do over assembly. The point of abstractions is to detach from the actual implementation. When you design an abstraction you need to consider what expressiveness you take away from the user with this detachment.

One last example concerns fixed-point arithmetic. (For example quantized neural networks make heavy use of fixed-point arithmetic.) We’ll dive into this next time.

External libraries (even the standard library) come at a cost: Selecting the right library, integrating it into your project, getting it safety-approved, watching it for critical bug fixes over the lifetime of your product, etc. ↩︎