Unsized integer types and signedness

2016-07-16 Permalink

Historically C had many arithmetic types of platform dependent sizes, and even more typedefs thereof. Initially it was intended that int will be of the machine word size that is the most natural for the architecture.[1] This is why when a type was omitted, int was assumed by default. Integer promotions guarantee that all smaller integers are first promoted to int before any calculations are performed.

C did not specify any fixed-width types, but at the same time allowed direct raw memory manipulation. Consequently, when one wanted to use types of set sizes (to guarantee a specific memory layout or some other reason), people ended up using the unsized types as if they were fixed to what they had on their concrete implementation. E.g. it was common to assume that int is 16-bit when developing for a 16-bit architecture.

However, the hardware quickly evolved and the registers got bigger. For backward compatibility reasons, the implementations kept the sizes of the arithmetic types mostly unchanged. Therefore, when long was not sufficient anymore, long long was introduced to utilize the newer hardware capacity, even though long could have fulfilled that role. This, in turn, required patching additional i/o library support. In C++ it introduced even more bloat in the form of extra num_put and iostream overloads.

In the end of the day we are left with this horrible lot of types whose sizes are all different in different circumstances, and there is no any rule to choose one type over the other!

Type	ISO minimum	Windows	FreeBSD amd64
char	8	8	8
short	16	16	16
int	16	32	32
long	32	32	64
long long	64	64	64

In retrospect this was a mistake. As a result, most of the time typedefs are introduced and used inplace of the built-in types. Here is a summary of the typedefs available just in the C standard:[2]

Type	Description
ptrdiff_t	Result of subtracting two pointers.
size_t	Result of `sizeof` operator.
intN_t	Width of exactly N bits.
int_leastN_t	The smallest of width of at least N bits.
int_fastN_t	The fastest of width of at least N bits.
intptr_t	Pointers can be converted to it and back.
intmax_t	The largest integer type. Used by preprocessor.

Theoretically it should be easy to choose one of these types in every context based on their description. However, in reality, the calculations quickly start mixing values coming from different parts of the program, implicitly mixing different types. It seems that the plethora of these typedefs are hard to use right, so eventually they fail to serve their original purpose.

This is madness.

One size fits all

Almost.

size_t and ptrdiff_t are closely tied to the machine pointer type.[3] size_t holds sizes of objects, whereas ptrdiff_t the differences of addresses within the objects. At a first glance it might look like ptrdiff_t should be just one bit larger than size_t. However, this is rarely the case, as typical implementations have size_t and ptrdiff_t of same size. In fact the standard does not restrict the relation between the two in any way, and subtraction of pointers which are further than PTRDIFF_MAX apart is already undefined, even if both point into the same array. Hence defining ptrdiff_t to be the signed version of size_t is consistent with the standard. Accordingly PTRDIFF_MAX will always be equal to SIZE_MAX/2.[4][5]

Pointers can be converted to intptr_t and back. This is useful mainly for address alignment. It is practically the same as ptrdiff_t, unless you are on a segmented architecture, which I hope you are not. In any case it is an optionally supported type which for overly exotic architectures cannot be relied on.

int_leastN_t is needed only if one believes they will ever port to an architecture where CHAR_BIT is larger than eight, which means it won’t be even POSIX compatible.

int_fastN_t is similarly unnecessary: All practically used architectures today can load registers zero or sign extended at no cost. So intN_t should be used if memory is at premium, or ptrdiff_t otherwise. ptrdiff_t is used all over the place for address calculations, thus it is reasonable to expect that any architecture handles it reasonably well.

The only legit use of intmax_t is for formatting.[6] Taking into account that the implementation might support exact types larger than the native machine word by means of extended precision arithmetic, it is acceptable for intmax_t to denote the largest supported type, accessible with intN_t, for formatting purposes.[7]

Taking a pragmatic stance, we conclude that size_t, ptrdiff_t and intmax_t are the only needed unsized types. It is tempting to scrap all the built-in integer types (except for char) and redefine size_t and ptrdiff_t to be synonyms of unsigned and int correspondingly. This is a significant language simplification.

Integer promotions

C promotes small integral operands to int before doing calculations. This is logical since not all processors can do arithmetics on integers smaller than the machine word. However, x86-line processors can do that, and even for those which cannot (or for some reason it is faster to do otherwise) today’s compilers are smart enough to understand that the intermediate narrowing in this code is unnecessary:

int8_t f(int8_t a, int8_t b, int8_t c) {
	return int8_t(a*b)*c;
}

This generates the same code as:

int8_t f(int8_t a, int8_t b, int8_t c) {
	return a*b*c;
}

By defining the result of arithmetic operators be of the same type as both operands[8] we save us some surprises (e.g. when indexing char a[0x10000] as a[i + j] with uint16_t i, j, or other unexpected results). But most importantly, we get rid of an existing dangerous misfeature in the language: implicit narrowing conversions! They are needed today in order for the second version of f to pass compilation, but they will not be needed following the modification.

Signedness of char

In C char, signed char and unsigned char are distinct types, and the signedness of char is unspecified. When it happens to be signed (which is frequently the case), this together with the integer promotion rules result in disastrous consequences. char, or an int initialized directly from a char, are far too frequently used as indices into a 256-entry tables. A signed char is frequently justified by the fact that plain int is a synonym for a signed int. This argument does not hold water, however, because plain char is still a distinct type from a signed char, unlike in the int case.

Is signedness needed?

I’ve heard a more radical opinion: since all modern hardware uses two’s-complement, the signedness does not have to be encoded in the type at all. Instead it should be a property of the relational operators, so that we will have signed and unsigned comparisons in the language but no signed and unsigned integer types.

Although this approach is interesting, relational operators are not the only operations which care for the signs. In addition there will be signed and unsigned division (including right shift operator) and all extending conversions will need to be explicitly annotated. I think that this is too impractical in a language a level higher than assembly.

Wide and Unicode characters

On a related note, the inclusion of wchar_t, char16_t and char32_t in the C++ standard is a mistake. UTF-8 is the only legitimate encoding on planet Earth. For the purpose of conversions to/from other encodings an exact-sized integer would do. The use of these extra types for overloading is extremely marginal, questionable and unjustifiable. For further details on UTF-8 read the UTF-8 Everywhere manifesto.

Footnotes

C literally says ‘natural’ rather than ‘efficient’. Considering the complexity of the modern architectures, it might not be longer the most efficient one. E.g. working with bytes might benefit from lesser memory bandwidth usage and the possibility of better utilization of vectorization.
I omit the unsigned equivalents for brevity.
Or its linear part if we consider segmented architectures with non-huge memory model.
There is a restriction that applies to 16-bit architectures. ptrdiff_t must be big enough to hold the whole [-65535, 65535] range whereas a [0, 65536) range for size_t would suffice. Following the line of thought in this article, this restriction can be lifted. E.g. near pointer subtraction on 8086 won’t have a meaningful sign if the object is larger than 32767 bytes. However this is OK. Firstly, because there is no reason to provide stronger guarantees on 16-bit machines than on others. Secondly, the situation with pointers isn’t special, as it is the same with all modulo 2^N arithmetics.
As for non two’s complement architectures, they are extinct for more than 20 years already. However, even if they were not, addresses are likely to be positive quantities there, thus making pointer differences already fit within same-sized size_t and ptrdiff_t model.
Though unlike C, C++ num_put does not support formatting numbers larger than long long.
Yet this is not always the case. Although there is __int128_t with Clang on my machine, intmax_t is still only 64-bit.
If they are of different types, they will be promoted following rules similar to those that already apply for the case when the operands are larger than int.