Unsized integer types and signedness
2016-07-16 Permalink
Historically C had many arithmetic types of platform dependent sizes, and even more typedefs thereof. Initially it was intended that int
will be of the machine word size that is the most natural for the architecture.[1] This is why when a type was omitted, int
was assumed by default. Integer promotions guarantee that all smaller integers are first promoted to int
before any calculations are performed.
C did not specify any fixed-width types, but at the same time allowed direct raw memory manipulation. Consequently, when one wanted to use types of set sizes (to guarantee a specific memory layout or some other reason), people ended up using the unsized types as if they were fixed to what they had on their concrete implementation. E.g. it was common to assume that int
is 16-bit when developing for a 16-bit architecture.
However, the hardware quickly evolved and the registers got bigger. For backward compatibility reasons, the implementations kept the sizes of the arithmetic types mostly unchanged. Therefore, when long
was not sufficient anymore, long long
was introduced to utilize the newer hardware capacity, even though long
could have fulfilled that role. This, in turn, required patching additional i/o library support. In C++ it introduced even more bloat in the form of extra num_put
and iostream
overloads.
In the end of the day we are left with this horrible lot of types whose sizes are all different in different circumstances, and there is no any rule to choose one type over the other!
Type | ISO minimum | Windows | FreeBSD amd64 |
---|---|---|---|
char | 8 | 8 | 8 |
short | 16 | 16 | 16 |
int | 16 | 32 | 32 |
long | 32 | 32 | 64 |
long long | 64 | 64 | 64 |
In retrospect this was a mistake. As a result, most of the time typedefs are introduced and used inplace of the built-in types. Here is a summary of the typedefs available just in the C standard:[2]
Type | Description |
---|---|
ptrdiff_t | Result of subtracting two pointers. |
size_t | Result of sizeof operator. |
intN_t | Width of exactly N bits. |
int_leastN_t | The smallest of width of at least N bits. |
int_fastN_t | The fastest of width of at least N bits. |
intptr_t | Pointers can be converted to it and back. |
intmax_t | The largest integer type. Used by preprocessor. |
Theoretically it should be easy to choose one of these types in every context based on their description. However, in reality, the calculations quickly start mixing values coming from different parts of the program, implicitly mixing different types. It seems that the plethora of these typedefs are hard to use right, so eventually they fail to serve their original purpose.
This is madness.
One size fits all
Almost.
size_t
and ptrdiff_t
are closely tied to the machine pointer type.[3] size_t
holds sizes of objects, whereas ptrdiff_t
the differences of addresses within the objects. At a first glance it might look like ptrdiff_t
should be just one bit larger than size_t
. However, this is rarely the case, as typical implementations have size_t
and ptrdiff_t
of same size. In fact the standard does not restrict the relation between the two in any way, and subtraction of pointers which are further than PTRDIFF_MAX
apart is already undefined, even if both point into the same array. Hence defining ptrdiff_t
to be the signed version of size_t
is consistent with the standard. Accordingly PTRDIFF_MAX
will always be equal to SIZE_MAX/2
.[4][5]
Pointers can be converted to intptr_t
and back. This is useful mainly for address alignment. It is practically the same as ptrdiff_t
, unless you are on a segmented architecture, which I hope you are not. In any case it is an optionally supported type which for overly exotic architectures cannot be relied on.
int_leastN_t
is needed only if one believes they will ever port to an architecture where CHAR_BIT
is larger than eight, which means it won’t be even POSIX compatible.
int_fastN_t
is similarly unnecessary: All practically used architectures today can load registers zero or sign extended at no cost. So intN_t
should be used if memory is at premium, or ptrdiff_t
otherwise. ptrdiff_t
is used all over the place for address calculations, thus it is reasonable to expect that any architecture handles it reasonably well.
The only legit use of intmax_t
is for formatting.[6] Taking into account that the implementation might support exact types larger than the native machine word by means of extended precision arithmetic, it is acceptable for intmax_t
to denote the largest supported type, accessible with intN_t
, for formatting purposes.[7]
Taking a pragmatic stance, we conclude that size_t
, ptrdiff_t
and intmax_t
are the only needed unsized types. It is tempting to scrap all the built-in integer types (except for char
) and redefine size_t
and ptrdiff_t
to be synonyms of unsigned
and int
correspondingly. This is a significant language simplification.
Integer promotions
C promotes small integral operands to int
before doing calculations. This is logical since not all processors can do arithmetics on integers smaller than the machine word. However, x86-line processors can do that, and even for those which cannot (or for some reason it is faster to do otherwise) today’s compilers are smart enough to understand that the intermediate narrowing in this code is unnecessary:
int8_t f(int8_t a, int8_t b, int8_t c) { return int8_t(a*b)*c; }
This generates the same code as:
int8_t f(int8_t a, int8_t b, int8_t c) { return a*b*c; }
By defining the result of arithmetic operators be of the same type as both operands[8] we save us some surprises (e.g. when indexing char a[0x10000]
as a[i + j]
with uint16_t i, j
, or other unexpected results). But most importantly, we get rid of an existing dangerous misfeature in the language: implicit narrowing conversions! They are needed today in order for the second version of f
to pass compilation, but they will not be needed following the modification.
Signedness of char
In C char
, signed char
and unsigned char
are distinct types, and the signedness of char
is unspecified. When it happens to be signed (which is frequently the case), this together with the integer promotion rules result in disastrous consequences. char
, or an int
initialized directly from a char
, are far too frequently used as indices into a 256-entry tables. A signed char
is frequently justified by the fact that plain int
is a synonym for a signed int
. This argument does not hold water, however, because plain char
is still a distinct type from a signed char
, unlike in the int
case.
Is signedness needed?
I’ve heard a more radical opinion: since all modern hardware uses two’s-complement, the signedness does not have to be encoded in the type at all. Instead it should be a property of the relational operators, so that we will have signed and unsigned comparisons in the language but no signed and unsigned integer types.
Although this approach is interesting, relational operators are not the only operations which care for the signs. In addition there will be signed and unsigned division (including right shift operator) and all extending conversions will need to be explicitly annotated. I think that this is too impractical in a language a level higher than assembly.
Wide and Unicode characters
On a related note, the inclusion of wchar_t
, char16_t
and char32_t
in the C++ standard is a mistake. UTF-8 is the only legitimate encoding on planet Earth. For the purpose of conversions to/from other encodings an exact-sized integer would do. The use of these extra types for overloading is extremely marginal, questionable and unjustifiable. For further details on UTF-8 read the UTF-8 Everywhere manifesto.
Footnotes
C literally says ‘natural’ rather than ‘efficient’. Considering the complexity of the modern architectures, it might not be longer the most efficient one. E.g. working with bytes might benefit from lesser memory bandwidth usage and the possibility of better utilization of vectorization.
- I omit the unsigned equivalents for brevity.
- Or its linear part if we consider segmented architectures with non-huge memory model.
There is a restriction that applies to 16-bit architectures.
ptrdiff_t
must be big enough to hold the whole [-65535, 65535] range whereas a [0, 65536) range forsize_t
would suffice. Following the line of thought in this article, this restriction can be lifted. E.g. near pointer subtraction on 8086 won’t have a meaningful sign if the object is larger than 32767 bytes. However this is OK. Firstly, because there is no reason to provide stronger guarantees on 16-bit machines than on others. Secondly, the situation with pointers isn’t special, as it is the same with all modulo 2N arithmetics.As for non two’s complement architectures, they are extinct for more than 20 years already. However, even if they were not, addresses are likely to be positive quantities there, thus making pointer differences already fit within same-sized
size_t
andptrdiff_t
model.Though unlike C, C++
num_put
does not support formatting numbers larger thanlong long
.Yet this is not always the case. Although there is
__int128_t
with Clang on my machine,intmax_t
is still only 64-bit.If they are of different types, they will be promoted following rules similar to those that already apply for the case when the operands are larger than
int
.