POSIX & C Types and Format Strings

ljrk

2021-08-02

When you compile code such as:

#include <stddef.h>
#include <stdio.h>

int main(void)
{
    size_t sz = 0;
    printf("%lu\n", sz);
}

your compiler may or may not warn about the format string being wrong. Depending on your system, there are chances your compiler wants you to use a “%llu” format string instead, since there a size_t is a long long unsigned. On your off-the-shelf AMD64 Linux, it’s just a long unsigned. Why is that, and how can we workaround this?

C Integral Types & Typedefs

The integral types of C are

char, short, int, long, long long,

as well as their unsigned variants, _Bool, void (as long as I didn’t miss any). They have the following format strings:

%hhd, %hd, %d, %ld, %lld

Everything else are typedefs (or non-integral types), that is, there exists some header with:

typedef off_t long

and another with

typedef size_t unsigned long

etc., for beauty’s sake. These typedefs aren’t part of the core language of C, but, in some cases like size_t part of the C standard, encompassing the C standard library (with printf, fread, …) as well. Other typedefs such as off_t aren’t part of C, but of the POSIX interface 1.

The C library function printf recognizes additional format strings for those types, such as

%zu (size_t), %p (void*), %s (char*), ...

However, the exact size of a size_t (e.g., unsigned long or unsigned long long) is unspecified. The standard instead simply demands that it shall be big enough to hold the size of any memory object. Given different hardware and operating systems, this may vary quite much. Why should a size_t be 64 bit on a small 16 bit CPU?

Depending on the platform, we have different typedefs for size_t. This means, on AMD64 Linux %zu may be in fact equivalent to %lu and the compiler won’t warn you, if you use the latter. However, as soon as you switch platforms, this code will fail. That’s why, we instead use “%zu” as it will be recognized by the compiler as the correct size at any rate.

However, some types, such as off_t aren’t part of the C standard, and as such there is no such format string available for off_t. This means, we must need to find out how big off_t is in order for it to work (POSIX at least requires it to be a signed type, so that’s fixed).

The Programming Environment: getconf

To query your system, you can ask the getconf(1) utility to return the available programming environments for a given configuration in a specific standard. The configurations of interest here are documented in c99(1):

$ cat test-off-t.sh
# /bin/sh

for name in _POSIX_V7_ILP32_OFF32 \
            _POSIX_V7_ILP32_OFFBIG \
            _POSIX_V7_LP64_OFF64 \
            _POSIX_V7_LPBIG_OFFBIG; do 
    printf "%s: %s\n" "$name" "$(getconf -v POSIX.1-2017 $name)";
done
$ ./test-off-t.sh
_POSIX_V7_ILP32_OFF32: undefined
_POSIX_V7_ILP32_OFFBIG: undefined
_POSIX_V7_LP64_OFF64: 1
_POSIX_V7_LPBIG_OFFBIG: undefined
$

So, on my Linux there’s only one available configuration, and that is a 64 bit off_t. Supposed, my system would suport both OFF32 and OFF64, then I could query the specific compiler flags to switch to the one or the other using

$ getconf -v POSIX.1-2017 POSIX_V7_ILP32_OFF32_CFLAGS

and

$ getconf -v POSIX.1-2017 POSIX_V7_ILP32_OFF64_CFLAGS

There are not many systems that are that configurable out there, with Solaris probably being one of the few exceptions.

Compile-Time Configuration

However, most of the time we don’t want to change our systems size of off_t for a speicifc piece of code but simply query it. While using getconf for this is fine, we actually can do this within C already since the <unistd.h> header already declares the same information. We can write:

#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>

#if defined(_POSIX_V7_ILP32_OFF32)
// 32-bit = int = %d
# define PRIofft "d"
#elif defined(_POSIX_V7_LP64_OFF64)
// 64-bit = long = %ld
# define PRIofft "ld"
#else
# error "Unsupported Programming Environment"
#endif


int main(void)
{
    off_t n = 0;
    printf("%" PRIofft "\n", n);
}

We use the trick that string literals (everything enclosed in ““) will be concatenated by the C Preprocessor. So, if our off_t is of 32 bit size, the macro PRIofft expands to "d" which will yield

printf("%" "d" "\n", n);

which in turn will become

printf("%d\n", n);

But hey, on Windows an int is just 16 bit and not 32 bit, you cannot use %d for off_t there! This is true, the C standard doesn’t specify that an int is 32 bit either, it just demands that it may hold all values from [-32767,+32767].

Luckily, C also has the header <stdint.h> providing us with fixed-width types like int32_t and <inttypes.h> providing macros like PRId32 alongside that expand to whatever format string we need to print a 32 bit integer.

We can now write:

#include <unistd.h>
#include <sys/types.h>
#include <stdio.h>
#include <inttypes.h>

#if defined(_POSIX_V7_ILP32_OFF32)
# define PRIofft PRId32 // 32-bit int
#elif defined(_POSIX_V7_LP64_OFF64)
# define PRIofft PRId64 // 64-bit int
#else
# error "Unsupported Programming Environment"
#endif


int main(void)
{
    off_t n = 0;
    printf("%" PRIofft "\n", n);
}

The Classic Alternative

A bit simpler, you could also just upcast the specific value to a uintmax_t or intmax_t respectively. A similar methods was traditionally used before C99 came around, introducing fixed-width types and [u]intmax_t.

Since before the introduction of these types, long long unsigned and long long int could reasonably be assumed to be the largest integer types available, sans platform-specific extensions 2, it would be quite safe to upcast any variable to these and print them that way. In the rare case that a platform defined, say off_t to be larger than a long long int one would have to workaround this using classic preprocessor macros.

But Why?

Other languages such as Java simply chose to have an int be the same width on every system. This has obvious upshots, but also some drawbacks. Every CPU that simply cannot provide that big integers can’t be used with this language3. Further, with C, an int is usually chosen to be of a size that’s rather efficient to use in our system. With a fixed width int this isn’t possible anymore.

Even more, C doesn’t even demand that a Byte must be 8 bit, which is highly useful for those people programming lowlevel audio DSPs (or old IBM mainframes). In fact, C doesn’t guarantee much more than a char being one Byte (however many bits that are, but at least 8, I think 4), and that short must be at least as big as a char and so on.

POSIX goes a bit farther by including all of the C standard, but also demanding absolutely ridiculous things like CHAR_BITS = 8 (while also providing us with open, read, write, getaddrinfo, …).

But even with POSIX, as we can see, many gaps are to be filled to allow variety of flexibility in the implementation. By now, many of the knobs that POSIX allowed to be configured have more or less converged to a few sane (or not so sane) defacto “standards”, and many programming languages nowadays primarily use fixed-width types as our many different CPU architectures (AMD64, PPC64, AARCH64, RISC-V64, …) have mostly agreed on some good behavior of integer widths, so maybe the need for non-fixed-with integers has indeed gone.

The Legacy, though, will live on forever.


  1. In fact, any type ending with _t is either C or POSIX, or someone messed up, since POSIX reserves all types ending with _t. If you plan to create your own type but want to stay POSIX compatible—don’t use _t.↩︎

  2. Yes, intmax_t can be larger than a long long int, and, in theory off_t can be as well—it’s just unlikely.↩︎

  3. Well, actually it can (just like Haskell can provide infinite-width Integers), but it also will be infinitely slow.↩︎

  4. If you happen to feel a big desire to find out how many bits are in one byte on your platform, you can look at the CHAR_BITS macro.↩︎