Signed quantities are a good default, and are easier to deal with when doing subtractions and mixing integers of different widths. (And integers includes pointers here, so it's very hard to not have different widths).
However unsigned integers are still very useful, I'd say essential, in low-level programming. For example when doing buffer management and memory allocation.
- bitwise operations
- modular arithmetic implemented with just ++, -- (ringbuffers, e.g TCP sequence numbers)
- using the full range of a 8-bit, 16-bit, 32-bit datatype (quite common)
- splitting a positive quantity into two smaller quantities, e.g. using a 16-bit index as 8-bit major index plus 8-bit minor index.
etc.
Don't forget that the signed vs unsigned integer is in some sense an artificial distinction. Machines have you put the distinction in the CPU instructions themselves, they don't track a "signed" property as part of values. And it can make sense to use the same value in different ways. However, C and many other languages decided to put a tag on the type, so operator syntax can be agnostic to signedness, and the compiler will choose the appropriate CPU instruction.
> However, C and many other languages decided to put a tag on the type, so operator syntax can be agnostic to signedness, and the compiler will choose the appropriate CPU instruction.
It mostly comes up with widening conversions (signed numbers must extend the sign bit, unsigned numbers set the extra bits to zero), unsigned/signed divide (and multiply, in case of a widened result) and greater than/less than comparisons (and of course geq/leq). (With signed comparison, A is less than B if by starting from INT_MIN (included) and iteratively incrementing you reach A before B. With unsigned comparison, A is less than B if by starting from 0 (included) and iteratively incrementing you reach A before B. This way of phrasing comparison as range inclusion is convenient, since it works around the wrapping concern in a rather clean way.)
Systems programmers love to hate on unsigned integers. Generations have been infected with the Java world model that integers have to be pretend number lines centered on zero. Guess what, you still have boundary conditions to deal with. There are times when you really really need to use the full word range without negative values. This happens more often with low level programming and machines with small word sizes, something fewer people are engaged in. It doesn't need to be the default. Ada has them sequestered as modular types but it's available to use when needed.
Java doesn't have unsigned as primitive types, because James Gosling did a series of interviews at Sun among "expert" C devs, and all got the C language rules for unsigned arithmetic wrong.
Yes I miss them in Java as primitives, however there are utility methods for unsigned arithmetic, that get it right.
The way he conducted those interviews, and the conclusions he drew from them, may have been flawed. Because the situation now is that C has unsigned types and Java mostly has not.
And despite all pitfalls especially around mixing signed and unsigned in C, unsigned types are very useful, I'd in fact say that for low-level programming they are essential.
I find it the opposite. Unsigned integers are intuitive, while signed integers are unintuitive and cause a lot of tricky bugs. Especially in languages, where signed overflow is undefined behavior.
It's pretty rare to have values that can be negative but are always integers. At least in the work I do. The most common case I encounter are approximations of something related to log probability. Such as various scores in dynamic programming and graph algorithms.
Most of the time, when you deal with integers, you need special handling to avoid negative values. Once you get used to thinking about unsigned integers, you quickly develop robust ways of avoiding situations where the values would be negative.
Why does an unsigned type for sizes or indices fare worse than a signed type? When do I want the -247th element in an array? When do I have a block that is -10 bytes in size?
In Java, unsigned arithmetic is available through an API and, as you said, it is pretty much only needed when marshalling to certain wire protocols or for FFI. Built-in unsigned types are useful primarily for bitfields or similar tiny types with up to 6 bits or so.
> Systems programmers love to hate on unsigned integers
I don't see this hate in Rust. I think this is a big thing in the C-related languages, and that the author has chosen to pretend that's the same for any "systems language" but it is not.
> There are times when you really really need to use the full word range without negative values.
There are a few of those, but that is the niche case. Certainly when we're talking about 64-bit size types. And if you want to cater to smaller size types, then just just template over the size type. Or, OK, some other trick if it's C rather than C++.
Sometimes (and very often in some scenarios/industries, i.e. HPC for graphics and simulation with indices for things like points, vertices, primvars, voxels, etc) you want pretty good efficiency of the size of the datatype as well for memory / cache performance reasons, because you're storing millions of them, and need to be random addressing (so can't really bit-pack to say 36 bytes, at least without overhead away from native types, which are really needed for maximum speed without any branching).
Losing half the range to make them signed when you only care about positive values 95% of the time (and in the rare case when you do any modulo on top of them you can cast, or write wrappers for that), is just a bad trade-off.
Yes, you've still then only doubled the range to 2^32, and you'll still hit it at some point, but that extra byte can make a lot of difference from a memory/cache efficiency standpoint without jumping to 64-bit.
So very often uint32_t is a very good sweet spot for size: int32_t is sometimes too small, and (u)int64_t is generally not needed and too wasteful.
>If sizes are unsigned, like in C, C++, Rust and Zig – then it follows that anything involving indexing into data will need to either be all unsigned or require casts. With C’s loose semantics, the problem is largely swept under the rug, but for Rust it meant that you’d regularly need to cast back and forth when dealing with sizes.
TBH I've had very little struggle with this at all. As long as you keep your values and types separate, the unsigned type that you got a number from originally feeds just fine into the unsigned type that you send it to next. Needing casting then becomes a very clear sign that you're mixing sources and there be dragons, back up and fix the types or stop using the wrong variable. It's a low-cost early bug detector.
Implicitly casting between integer types though... yeah, that's an absolute freaking nightmare.
> As long as you keep your values and types separate, the unsigned type that you got a number from originally feeds just fine into the unsigned type that you send it to next.
Part of me feels like direct numeric array indexing is one of the last holdouts of a low-level operation screaming for some standardized higher-level abstraction. I'm not saying to get rid of the ability to index directly, but if the error-resistant design here is to use numeric array indices as though they were opaque handles, maybe we just need to start building support for opaque handles into our languages, rather than just handing out numeric indices and using thoughts and prayers to stop people from doing math on them.
For analogy, it's like how standardizing on iterators means that Rust code rarely needs to index directly, in contrast with C, where the design of for-loops encourages indexing. But Rust could still benefit from opaque handles to take care of those in-between cases where iterators are too limiting and yet where numeric indices are more powerful than needed.
> Part of me feels like direct numeric array indexing is one of the last holdouts of a low-level operation screaming for some standardized higher-level abstraction.
Maybe this isn't what you're suggesting, but it's already possible to make an interface that prevents callers from doing math on indices in Rust — just return a struct that has a private member for the index. The caller can pass the value back at which point you can unwrap it and do index arithmetic.
You do need to store those if they're totally opaque though, e.g. how do you represent a range without holding N tokens? Often I like it, and it allows changing the underlying storage to be e.g. generational with ~no changes, but it kinda can't be enforced for runtime-cost reasons.
Using a unique type per array instance though, that I quite like, and in index-heavy code I often build that explicitly (e.g. in Go) because it makes writing the code trivially correct in many cases. Indices are only very rarely shared between arrays, and exceptions can and should look different because they need careful scrutiny compared to "this is definitely intended for that array".
1. use signed integers for everything except bit-wise operations and modulo math (e.g. "almost always signed")
2. make implicit sign conversion an error via `-Werror -Wsign-conversion`
The problem with making sizes and indices unsigned (even if they can't be negative) is that you'd might to want to add negative offsets, and that either requires explicit casting in languages without explicit signed/unsigned conversion (e.g. additional hassle and reducing readability), or is a footgun area in languages with implicit sign conversion.
It's not really signed vs unsigned that's the issue, IMO. It's (mostly, in C) undefined behavior and implicit conversions?
I'm not sure Go is saner just because len is an int. Well, maybe, depending on how you look at it. Defining len to be signed int, means the largest valid len is half your address space, which also means half of all possible indexes are always invalid; which makes some things easier.
But it's really that integer arithmetic is not undefined behavior regardless of signedness, that bounds are checked, and that even indexing your slice with an int64 on a 32-bit CPU does the full correct bounds check. In fact, you can use any integer type as an index.
Given all of the above, indexing with a uint or an int is actually indiferent. In that case, the bound check is a single unsigned <len compare (despite the fact that len is signed).
What's really painful, is trying to handle a full 32-bit address space with 32-bit addresses and sizes, like in Wasm; you need 33-bit math. So in a sense, limiting sizes to 31-bit (signed) does help. But at the language level, IMO, the rest matters more.
For signed overflow we have sanitizers, and for conversions C compilers warnings in C. Bounds checking can also be done with sanitizers (but is a bit more tricky). So no, I do not think the undefined behavior is really a big problem. In fact, it helps us find the problem because every overflow can be considered a programming error.
Error due to unsigned wraparound are a much bigger issue, because the lead to subtle issues where neither automatic warnings nor sanitizers help, exactly because it is well-defined and no automatic tool can tell whether the behavior is intended or wrong.
Just this week I've had a C compilers silently delete me an entire function call because of UB (infinite loop without side effects). Took me a day to figure out. So that's a problem for me.
I don't think I've ever had an hard to debug issue in Go because of signed/unsigned wrap around. Particularly a memory issue.
If anything, and there I guess I agree with the article, I wish Go had implicit conversions to wider types: to make the problematic ones stand out.
I guess the reason it doesn't is that they're different named types, which would be a problem when you create a named type for the purpose of forcing explicit type conversions. But maybe the default ones could implicitly implement a numeric tower, where exact conversions can be implicit.
> Error due to unsigned wraparound are a much bigger issue
This is a type design mistake. The unsigned integers should not wrap by default. It makes absolute sense, given all the constraints and the fact that it's doing New Jersey "implementation simplicity dominates" design that K&R C only provides a wrapping unsigned type, but that's an excuse for K&R C which is a 1960s programming language.
The excuse gets shakier and shakier the further you move past that. C3 even named these types differently, so they're certainly under no obligation to provide the wrapping unsigned integers as if that's just magically what you mean. In most cases it's not what you mean. The excuse given in the article is way too thin.
Rust's Wrapping<u32> is the same thing as the wrapping 32-bit unsigned integer in C or C++ today, but most people don't use it because they do not actually want the wrapping 32-bit unsigned integer. This is a "spelling matters" ergonomics class again like the choice to name the brutally fast but unstable general comparison sort [T]::sort_unstable whereas both C and C++ leave the noob who didn't know about sort stability to find out for themselves because they name this just "sort" and you get to keep both halves when you break things...
> But what about the range? While it’s true that you get twice the range, surprisingly often the code in the range above signed-int max is quite bug-ridden. Any code doing something like (2U * index) / 2U in this range will have quite the surprise coming.
Alas, (2S * signed_index) / 2S will similarly result in surprises the moment the signed_index hits half the signed-int max. There's no free lunch when trying to cheat the integer ranges.
The difference is that in the unsigned case you get a seemingly plausible value, and in the signed case you get a negative value which you can be sure is wrong. This is the problem.
In some languages, the signed version is undefined behavior. You may get a negative value, INT_MAX / 2, or an error. Or the compiler may detect the undefined behavior, which according to the standard cannot happen, and mutilate your code in unexpected ways.
There is a really convincing set of arguments against this idea by Robert Seacord[1]. I used to be in the signed size camp, but I've come around to preferring unsigned as much as possible because it's much easier to reason about. I think there are far more footguns than people realize when it comes to signed integers.
In my reading, what Stroustroup is saying is that given other problems in c/c++, that singed sizes are less bad than unsigned but both have clear and significant deficiencies. A new language doesn't have to inherit all of these deficiencies.
No. He says that signed/unsigned arithmetic is a universal problem. And in the context of std::span, using signed arithmetic is the correct choice rather than shoehorning in size_t to make it more cosmetically consistent with the rest of the STL.
> The former is easier to define, but has the downside of essentially “silencing warnings”. Let’s say the code was originally written to cast an u16 to u32, but later the variable type changes from u16 to u64 and the cast is now actually silently truncating things. Here we have casts becoming a sort of “silence all warnings”.
Well … we even mention Rust in the paragraph right before this. In Rust, you can up a u16 to a u64 this way:
let bigger: u32 = x.into();
or
let bigger = u32::from(x);
The conversion `from` is infallible, because a u16 always fits in a u32. There is no `from(u64) -> u32`, because as the article notes, that would truncate, so if we did change the type to u64, the code would now fail to compile. (And we'd be forced to figure out what we want to do here.)
(There are fallible conversions, too, in the form of try_from, that can do u64 → u32, but will return an error if the conversion fails.)
Similarly, for,
for (uint x = 10; x >= 0; x--) // Infinte loop!
This is why I think implicit wrapping is a bad idea in language design. Even Rust went down the wrong path (in my mind) there, and I think has worked back towards something safer in recent years. But Rust provides a decent example here too; this is pseudo-code:
for (uint x = 10; x.is_some(); x = x.checked_sub(1))
Where `checked_sub` is returns `None` instead of wrapping, providing us a means to detect the stopping point. So, something like that. (Though you'd probably also want to destructure the option into the uint for use inside the loop.) Of course, higher-level stuff always wins out here, I think, and in Rust you wouldn't write the above; instead something like,
for x in (0..=10).rev()
(And even then, if we need indexes; usually, one would prefer to iterate through a slice or something like that. The higher-level concept of iterators usually dispenses with most or all uses of indexes, and in the rare cases when needed, most languages provide something like `enumerate` to get them from the iterator.)
I might be a contrarian in that I actually like using unsigned integers for sizes and indexes. In my experience, most of their trappings can be prevented by treating any subtraction involving them as a `reinterpret_cast`: i.e.
* Do your utmost to rewrite the code in order to avoid doing that (e.g. reordering disequations to transform subtractions into additions).
* If not possible, think very hard about any possible edge case: you most certainly need an additional `if` to deal with those.
* When analyzing other people's code during troubleshooting merge reviews, assume any formula involving an unsigned integer and a minus sign is wrong.
I am personally moving in the opposite direction. I haven't meaningfully used a signed integer in years, and I see signed integers as being for more niche use-cases. I mainly only use a signed types when I want to do a "signed shift right". If there was a >>> operator in Zig I wouldn't even think of signed integers.
Given your examples, I think you'd have fewer issues if you were working with unsigned integers exclusively. Although I'm curious about what other code you were referencing with this:
"But seeing how each change both made the code easier to reason about and more correct, I couldn’t deny the evidence."
With regards to modulo, in Zig if you try to use it with a signed integer it will tell you to specify whether you want `@mod` or `@rem` semantics. In my case, I'd almost never write `x % 2`, I'd write `x & 1`. I do use unsigned division but I'd pretty much never write code that would emit the `div` instruction.
I'm not saying you're wrong though! Everyone has a different mind. If you attain higher correctness and understandability through using signed integers, that's great. I'm just saying I'm in the opposite camp.
Zig also differentiates between the wrapping and non-wrapping operators. The for loop example would toss a runtime error when the index underflowed in most compiler modes.
The if statement won't work since Zig would force a cast.
The tricky wrap sucks unless you use a power of 2. Then the Zig type can match (u4, u5, u7, etc.) and you would use wrapping arithmetic operators. And on smaller CPUs you NEED to use a power of 2 because division and mod are expensive.
I know language designers have a lot of trade-offs to consider... But I would say if you know a value will logically always be >= 0, better to have a type that reflects that.
The potential bugs listed would be prevented by, e.g. "x--" won't compile without explicitly supplying a case for x==0 OR by using some more verbose methods like "decrement_with_wrap".
The trade-off is lack of C-like concise code, but more safe and explicit.
> But I would say if you know a value will logically always be >= 0, better to have a type that reflects that.
Except that's not quite what unsigned types do. They are not (just) numbers that will always be >= 0, but numbers where the value of `1 - 2` is > 1 and depends on the type. This is not an accident but how these types are intended to behave because what they express is that you want modular arithmetic, not non-negative integers.
> e.g. "x--" won't compile without explicitly supplying a case for x==0
If you want non-negative types (which, again, is not what unsigned types are for) you also run into difficulties with `x - y`. It's not so simple.
There are many useful constraints that you might think it's "better to have a type that reflects that" - what about variables that can only ever be even? - but it's often easier said than done.
That's true for signed numbers too though? `int_min - 2 > int_min`
I agree they're a bit more error-prone in practice, but I suspect a huge part of that is because people are so used to signed numbers because they're usually the default (and thus most examples assume signed, if they handle extreme values correctly at all (much example code does not)). And, legitimately, zero is a more commonly-encountered value... but that can push errors to occur sooner, which is generally a desirable thing.
If you have "uint x" and "uint y", then for "x - y", the programmer should explicitly write two cases (a) no underflow, i.e. x >= y, and (b) underflow, x < y. The syntax for that... that is an open question.
> what about variables that can only ever be even
Yes, maybe you should have an "EvenInt" type, if that is important. Maybe you should be able to declare a variable to be 7...13, just like a "uint8" can declare something 0...255. Of course, the type-checker can get complicated, and perhaps simply fail to type-check some things. But, having compile-time constraints to what you know your variables will be is good, IMHO.
This is true, which means that a language has to be designed from the ground up to deal with these problems or there will always be inscrutable bugs due to misuse of arithmetic results. A simple example in a c-like language would be that the following function would not compile:
unsigned foo(unsigned a, unsigned b) { return a - b; }
but this would:
unsigned foo(unsigned a, unsigned b) {
auto c = a - b;
return c >= 0 ? c : 0;
}
Assuming 32 bit unsigned and int, the type of c should be computed as the range [-0xffffffff, 0xffffffff], which is different from int [-0x100000000, 0x7fffffff]. Subtle things like this are why I think it is generally a mistake to type annotate the result of a numerical calculation when the compiler can compute it precisely for you.
Note that in Zig, unsigned integer have the sqle semantic qs integers on overflow (trap or wrap or UB).
You also have operators providing wrapping.
That is the correct solution.
I think it should be alike in Pascal where you have size ranges as types, and then, you can declare that this collection fall on this range (and very nicely, you can make it at enum):
> If sizes are unsigned, like in C, C++, Rust and Zig – then it follows that anything involving indexing into data will need to either be all unsigned or require casts.
I don’t really get this claim. Indexing should just look up the element corresponding to the value provided. It’s easy to come up with semantics that are intuitive and sound, even if signed integers or ones smaller than size_t are used.
Indexing does that, but the indices must vary in a certain range, whose limits are frequently determined by using something like "sizeof(array)/sizeof(element)" which is an unsigned number.
This is especially inconvenient in C, where there exist extremely dangerous legacy implicit casts between signed integers and unsigned integers, which have a great probability of generating incorrect values.
Because the index is typically a signed integer, comparing it with an unsigned limit without using explicit casts is likely to cause bugs. Using explicit casts of smaller unsigned integers towards bigger signed integers results in correct code, but it is cumbersome.
These problems are avoided as said in TFA, by making "sizeof" and the like to have 64-bit signed integer values, instead of unsigned values.
Well chosen implicit conversions are good for a programming language, by reducing unnecessary verbosity, but the implicit integer conversions of C are just wrong and they are by far the worst mistake of C much worse than any other C feature.
Other C features are criticized because they may be misused by inexperienced or careless programmers, but most of the implicit integer conversions are just incorrect. There is no way of using them correctly. Only the conversions from a smaller signed integer to a bigger signed integer are correct.
Mixed signedness conversions have always been wrong and the conversions between unsigned integers have been made wrong by the change in the C standard that has decided that the unsigned integers are integer residues modulo 2^N and they are not non-negative integers.
For modular integers, the only correct conversions are from bigger numbers to smaller numbers, i.e. the opposite of the implicit conversions of C. The implicit conversions of C unsigned numbers would have been correct for non-negative integers, but in the current C standard there are no such numbers.
The current C standard is inconsistent, because the meaning of sizeof is of a non-negative integer and this is also true for the conversions between unsigned numbers, but all the arithmetic operations with unsigned numbers are defined to be operations with integer residues, not operations with non-negative numbers.
The hardware of most processors implements at least 3 kinds of arithmetic operations: operations with signed integers, operations with non-negative integers and operations with integer residues.
Any decent programming language should define distinct types for these kinds of numbers, otherwise the only way to use completely the processor hardware is to use assembly language. Because C does not do this, you have to use at least inline assembly, if not separate assembly source files, for implementing operations with big numbers.
My comment was a bit tongue in cheek. Obviously it is a hard problem. But in a profession where we work with machines that literally were made to crunch numbers, and where abstraction is something we deal with daily, why can’t we have a performant abstraction for doing arbitrary calculations? The answer is that to be performant it must be solved in hardware, which would cost more than the hardware we have.
So in fact it is not just telling me it’s a hard problem, it’s telling me that the cost-benefit is still not there. It’s like it’s just not a very important problem (in an economic sense). And that is what surprises me, given that computers were made to do arbitrary calculations.
I don't get it. Is this a parody of poor design decisions?
Sure, it's possible to write bugs in C. And if you really want to, you can disable the compiler warnings which flag tautologous comparisons and mixed-sign comparisons (a common reason for doing this is to avoid spurious warnings in generic-type code).
But, uhh, "people can deliberately write bugs" has got to be the weakest justification I've ever seen for changing a language feature -- especially one as fundamental as "sizes of objects can't be negative".
The C language does not have any data type that has the property "can't be negative".
Signed integers can be negative. The so-called "unsigned" integers of C are integer residues modulo 2^N, which are neither positive nor negative, i.e. these concepts are not applicable to "unsigned" integers.
An alternative view is that any C "unsigned" is both positive and negative. For example the unsigned short "1" is the same number as "65537" and as "-65535".
So any sizeof value in C is negative (while also being positive).
In contradiction with what you say, the change described in TFA, by making sizes 64-bit signed integers, is the only method to guarantee that the sizes are non-negative in a language that does not have dedicated non-negative integers.
Other programming languages have non-negative integers, but C and C++ and many languages derived from them do not have such integers.
The arithmetic operations with non-negative integers differ from the arithmetic operations of C. On overflows and underflows, they either generate exceptions or have saturating behavior.
> An alternative view is that any C "unsigned" is both positive and negative. For example the unsigned short "1" is the same number as "65537" and as "-65535".
This can be disproven by the fact that dividing by `unsigned e = 1U` is well defined and always yields the starting number. If the unsigned numbers were really modular numbers as you suggest, division could not be defined.
Leaving aside the fact that, yes, unsigned integer types are definitely not negative -- my point wasn't about types at all. Objects cannot take up a negative number of bytes of memory!
It seems like they've identified common bugs patterns in C that would have been ameliorated by using signed, but come to the wrong conclusion that signed is the correct answer rather than that C is poorly designed for making the broken code the easy option.
Fix the language. Don't hack around it by using the wrong type.
I hate using languages that only have signed integers. Using integers that can’t be negative fits many problems nicely and avoids the edge case of having to check for negative.
You are perfectly right, but neither C nor C++ nor many more recent languages derived from them have non-negative integers.
The so-called "unsigned" integers of C are integer residues, where each value can be interpreted either as both positive and negative or as neither positive nor negative. In any case no "unsigned" value can be said to be non-negative.
You have to go back to languages not contaminated by C, like Ada, to find true non-negative integers among the primitive data types.
In C++, it is possible to define a non-negative integer type, which can have good performance if you implement its operations in assembly language.
However I am not aware of an open-source library including such a type.
I really appreciate your comments in this thread adrian_b. Could you point me at a brief summary of how Ada (or Pascal?) non-negative ints work? What is a compile error, what is a guaranteed run-time error, etc.
It's not "can't be negative", it's just that the semantics for negativity is wrapping around.
And - yes, there are very important use cases for unsigned/modulo-2n/wraparound values. But sizes of data structures are generally _not_ one of those use cases. The fact that the size is non-negative does not mean that the type should be unsigned. You should still be able to, say, subtract sizes and get a signed value which may be negative.
That’s definitely not true. Unsigned ints have no “negativity” semantic. Wrapping around is what happens when you decrement the minimum value of any integer type, including signed types. Regardless of the type you use to represent an integer value that cannot legally be negative, you will have to take care not to allow your program to return values lower than zero for things like indices or sizes.
Signed quantities are a good default, and are easier to deal with when doing subtractions and mixing integers of different widths. (And integers includes pointers here, so it's very hard to not have different widths).
However unsigned integers are still very useful, I'd say essential, in low-level programming. For example when doing buffer management and memory allocation.
etc.
Don't forget that the signed vs unsigned integer is in some sense an artificial distinction. Machines have you put the distinction in the CPU instructions themselves, they don't track a "signed" property as part of values. And it can make sense to use the same value in different ways. However, C and many other languages decided to put a tag on the type, so operator syntax can be agnostic to signedness, and the compiler will choose the appropriate CPU instruction.
> However, C and many other languages decided to put a tag on the type, so operator syntax can be agnostic to signedness, and the compiler will choose the appropriate CPU instruction.
It mostly comes up with widening conversions (signed numbers must extend the sign bit, unsigned numbers set the extra bits to zero), unsigned/signed divide (and multiply, in case of a widened result) and greater than/less than comparisons (and of course geq/leq). (With signed comparison, A is less than B if by starting from INT_MIN (included) and iteratively incrementing you reach A before B. With unsigned comparison, A is less than B if by starting from 0 (included) and iteratively incrementing you reach A before B. This way of phrasing comparison as range inclusion is convenient, since it works around the wrapping concern in a rather clean way.)
Systems programmers love to hate on unsigned integers. Generations have been infected with the Java world model that integers have to be pretend number lines centered on zero. Guess what, you still have boundary conditions to deal with. There are times when you really really need to use the full word range without negative values. This happens more often with low level programming and machines with small word sizes, something fewer people are engaged in. It doesn't need to be the default. Ada has them sequestered as modular types but it's available to use when needed.
Java doesn't have unsigned as primitive types, because James Gosling did a series of interviews at Sun among "expert" C devs, and all got the C language rules for unsigned arithmetic wrong.
Yes I miss them in Java as primitives, however there are utility methods for unsigned arithmetic, that get it right.
The way he conducted those interviews, and the conclusions he drew from them, may have been flawed. Because the situation now is that C has unsigned types and Java mostly has not.
And despite all pitfalls especially around mixing signed and unsigned in C, unsigned types are very useful, I'd in fact say that for low-level programming they are essential.
Java has char as an unsigned 16-bit integer type. They should have made byte unsigned as well.
3 replies →
Having them available is not the issue, using them for sizes and indices is what causes a lot of tricky bugs.
I find it the opposite. Unsigned integers are intuitive, while signed integers are unintuitive and cause a lot of tricky bugs. Especially in languages, where signed overflow is undefined behavior.
It's pretty rare to have values that can be negative but are always integers. At least in the work I do. The most common case I encounter are approximations of something related to log probability. Such as various scores in dynamic programming and graph algorithms.
Most of the time, when you deal with integers, you need special handling to avoid negative values. Once you get used to thinking about unsigned integers, you quickly develop robust ways of avoiding situations where the values would be negative.
3 replies →
Why does an unsigned type for sizes or indices fare worse than a signed type? When do I want the -247th element in an array? When do I have a block that is -10 bytes in size?
10 replies →
In Java, unsigned arithmetic is available through an API and, as you said, it is pretty much only needed when marshalling to certain wire protocols or for FFI. Built-in unsigned types are useful primarily for bitfields or similar tiny types with up to 6 bits or so.
I miss them for doing bit juggling like file headers or networking packets.
However I do concede writing a few helper methods isn't that much of a burden.
1 reply →
> Systems programmers love to hate on unsigned integers
I don't see this hate in Rust. I think this is a big thing in the C-related languages, and that the author has chosen to pretend that's the same for any "systems language" but it is not.
> There are times when you really really need to use the full word range without negative values.
There are a few of those, but that is the niche case. Certainly when we're talking about 64-bit size types. And if you want to cater to smaller size types, then just just template over the size type. Or, OK, some other trick if it's C rather than C++.
Sometimes (and very often in some scenarios/industries, i.e. HPC for graphics and simulation with indices for things like points, vertices, primvars, voxels, etc) you want pretty good efficiency of the size of the datatype as well for memory / cache performance reasons, because you're storing millions of them, and need to be random addressing (so can't really bit-pack to say 36 bytes, at least without overhead away from native types, which are really needed for maximum speed without any branching).
Losing half the range to make them signed when you only care about positive values 95% of the time (and in the rare case when you do any modulo on top of them you can cast, or write wrappers for that), is just a bad trade-off.
Yes, you've still then only doubled the range to 2^32, and you'll still hit it at some point, but that extra byte can make a lot of difference from a memory/cache efficiency standpoint without jumping to 64-bit.
So very often uint32_t is a very good sweet spot for size: int32_t is sometimes too small, and (u)int64_t is generally not needed and too wasteful.
3 replies →
>If sizes are unsigned, like in C, C++, Rust and Zig – then it follows that anything involving indexing into data will need to either be all unsigned or require casts. With C’s loose semantics, the problem is largely swept under the rug, but for Rust it meant that you’d regularly need to cast back and forth when dealing with sizes.
TBH I've had very little struggle with this at all. As long as you keep your values and types separate, the unsigned type that you got a number from originally feeds just fine into the unsigned type that you send it to next. Needing casting then becomes a very clear sign that you're mixing sources and there be dragons, back up and fix the types or stop using the wrong variable. It's a low-cost early bug detector.
Implicitly casting between integer types though... yeah, that's an absolute freaking nightmare.
> As long as you keep your values and types separate, the unsigned type that you got a number from originally feeds just fine into the unsigned type that you send it to next.
Part of me feels like direct numeric array indexing is one of the last holdouts of a low-level operation screaming for some standardized higher-level abstraction. I'm not saying to get rid of the ability to index directly, but if the error-resistant design here is to use numeric array indices as though they were opaque handles, maybe we just need to start building support for opaque handles into our languages, rather than just handing out numeric indices and using thoughts and prayers to stop people from doing math on them.
For analogy, it's like how standardizing on iterators means that Rust code rarely needs to index directly, in contrast with C, where the design of for-loops encourages indexing. But Rust could still benefit from opaque handles to take care of those in-between cases where iterators are too limiting and yet where numeric indices are more powerful than needed.
> Part of me feels like direct numeric array indexing is one of the last holdouts of a low-level operation screaming for some standardized higher-level abstraction.
This paragraph reminds me a bit of Dex: https://arxiv.org/abs/2104.05372
Maybe this isn't what you're suggesting, but it's already possible to make an interface that prevents callers from doing math on indices in Rust — just return a struct that has a private member for the index. The caller can pass the value back at which point you can unwrap it and do index arithmetic.
1 reply →
You do need to store those if they're totally opaque though, e.g. how do you represent a range without holding N tokens? Often I like it, and it allows changing the underlying storage to be e.g. generational with ~no changes, but it kinda can't be enforced for runtime-cost reasons.
Using a unique type per array instance though, that I quite like, and in index-heavy code I often build that explicitly (e.g. in Go) because it makes writing the code trivially correct in many cases. Indices are only very rarely shared between arrays, and exceptions can and should look different because they need careful scrutiny compared to "this is definitely intended for that array".
Finally a language doing the right thing :)
My two ruls of thumb for C code are:
1. use signed integers for everything except bit-wise operations and modulo math (e.g. "almost always signed")
2. make implicit sign conversion an error via `-Werror -Wsign-conversion`
The problem with making sizes and indices unsigned (even if they can't be negative) is that you'd might to want to add negative offsets, and that either requires explicit casting in languages without explicit signed/unsigned conversion (e.g. additional hassle and reducing readability), or is a footgun area in languages with implicit sign conversion.
But a blog doing the wrong thing. Who decided that light grey on white was a great way to present text?
For anyone else struggling to read it, Ctrl-A will make it legible.
It's not really signed vs unsigned that's the issue, IMO. It's (mostly, in C) undefined behavior and implicit conversions?
I'm not sure Go is saner just because len is an int. Well, maybe, depending on how you look at it. Defining len to be signed int, means the largest valid len is half your address space, which also means half of all possible indexes are always invalid; which makes some things easier.
But it's really that integer arithmetic is not undefined behavior regardless of signedness, that bounds are checked, and that even indexing your slice with an int64 on a 32-bit CPU does the full correct bounds check. In fact, you can use any integer type as an index.
Given all of the above, indexing with a uint or an int is actually indiferent. In that case, the bound check is a single unsigned <len compare (despite the fact that len is signed).
What's really painful, is trying to handle a full 32-bit address space with 32-bit addresses and sizes, like in Wasm; you need 33-bit math. So in a sense, limiting sizes to 31-bit (signed) does help. But at the language level, IMO, the rest matters more.
For signed overflow we have sanitizers, and for conversions C compilers warnings in C. Bounds checking can also be done with sanitizers (but is a bit more tricky). So no, I do not think the undefined behavior is really a big problem. In fact, it helps us find the problem because every overflow can be considered a programming error.
Error due to unsigned wraparound are a much bigger issue, because the lead to subtle issues where neither automatic warnings nor sanitizers help, exactly because it is well-defined and no automatic tool can tell whether the behavior is intended or wrong.
Do you always run with those sanitizers in place?
Just this week I've had a C compilers silently delete me an entire function call because of UB (infinite loop without side effects). Took me a day to figure out. So that's a problem for me.
I don't think I've ever had an hard to debug issue in Go because of signed/unsigned wrap around. Particularly a memory issue.
If anything, and there I guess I agree with the article, I wish Go had implicit conversions to wider types: to make the problematic ones stand out.
I guess the reason it doesn't is that they're different named types, which would be a problem when you create a named type for the purpose of forcing explicit type conversions. But maybe the default ones could implicitly implement a numeric tower, where exact conversions can be implicit.
> Error due to unsigned wraparound are a much bigger issue
This is a type design mistake. The unsigned integers should not wrap by default. It makes absolute sense, given all the constraints and the fact that it's doing New Jersey "implementation simplicity dominates" design that K&R C only provides a wrapping unsigned type, but that's an excuse for K&R C which is a 1960s programming language.
The excuse gets shakier and shakier the further you move past that. C3 even named these types differently, so they're certainly under no obligation to provide the wrapping unsigned integers as if that's just magically what you mean. In most cases it's not what you mean. The excuse given in the article is way too thin.
Rust's Wrapping<u32> is the same thing as the wrapping 32-bit unsigned integer in C or C++ today, but most people don't use it because they do not actually want the wrapping 32-bit unsigned integer. This is a "spelling matters" ergonomics class again like the choice to name the brutally fast but unstable general comparison sort [T]::sort_unstable whereas both C and C++ leave the noob who didn't know about sort stability to find out for themselves because they name this just "sort" and you get to keep both halves when you break things...
6 replies →
> But what about the range? While it’s true that you get twice the range, surprisingly often the code in the range above signed-int max is quite bug-ridden. Any code doing something like (2U * index) / 2U in this range will have quite the surprise coming.
Alas, (2S * signed_index) / 2S will similarly result in surprises the moment the signed_index hits half the signed-int max. There's no free lunch when trying to cheat the integer ranges.
The difference is that in the unsigned case you get a seemingly plausible value, and in the signed case you get a negative value which you can be sure is wrong. This is the problem.
In some languages, the signed version is undefined behavior. You may get a negative value, INT_MAX / 2, or an error. Or the compiler may detect the undefined behavior, which according to the standard cannot happen, and mutilate your code in unexpected ways.
Bjarne agrees.
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p14...
There is a really convincing set of arguments against this idea by Robert Seacord[1]. I used to be in the signed size camp, but I've come around to preferring unsigned as much as possible because it's much easier to reason about. I think there are far more footguns than people realize when it comes to signed integers.
[1] https://www.youtube.com/watch?v=82jVpEmAEV4
In my reading, what Stroustroup is saying is that given other problems in c/c++, that singed sizes are less bad than unsigned but both have clear and significant deficiencies. A new language doesn't have to inherit all of these deficiencies.
No. He says that signed/unsigned arithmetic is a universal problem. And in the context of std::span, using signed arithmetic is the correct choice rather than shoehorning in size_t to make it more cosmetically consistent with the rest of the STL.
> The former is easier to define, but has the downside of essentially “silencing warnings”. Let’s say the code was originally written to cast an u16 to u32, but later the variable type changes from u16 to u64 and the cast is now actually silently truncating things. Here we have casts becoming a sort of “silence all warnings”.
Well … we even mention Rust in the paragraph right before this. In Rust, you can up a u16 to a u64 this way:
or
The conversion `from` is infallible, because a u16 always fits in a u32. There is no `from(u64) -> u32`, because as the article notes, that would truncate, so if we did change the type to u64, the code would now fail to compile. (And we'd be forced to figure out what we want to do here.)
(There are fallible conversions, too, in the form of try_from, that can do u64 → u32, but will return an error if the conversion fails.)
Similarly, for,
This is why I think implicit wrapping is a bad idea in language design. Even Rust went down the wrong path (in my mind) there, and I think has worked back towards something safer in recent years. But Rust provides a decent example here too; this is pseudo-code:
Where `checked_sub` is returns `None` instead of wrapping, providing us a means to detect the stopping point. So, something like that. (Though you'd probably also want to destructure the option into the uint for use inside the loop.) Of course, higher-level stuff always wins out here, I think, and in Rust you wouldn't write the above; instead something like,
(And even then, if we need indexes; usually, one would prefer to iterate through a slice or something like that. The higher-level concept of iterators usually dispenses with most or all uses of indexes, and in the rare cases when needed, most languages provide something like `enumerate` to get them from the iterator.)
I might be a contrarian in that I actually like using unsigned integers for sizes and indexes. In my experience, most of their trappings can be prevented by treating any subtraction involving them as a `reinterpret_cast`: i.e.
* Do your utmost to rewrite the code in order to avoid doing that (e.g. reordering disequations to transform subtractions into additions). * If not possible, think very hard about any possible edge case: you most certainly need an additional `if` to deal with those. * When analyzing other people's code during troubleshooting merge reviews, assume any formula involving an unsigned integer and a minus sign is wrong.
I am personally moving in the opposite direction. I haven't meaningfully used a signed integer in years, and I see signed integers as being for more niche use-cases. I mainly only use a signed types when I want to do a "signed shift right". If there was a >>> operator in Zig I wouldn't even think of signed integers.
Given your examples, I think you'd have fewer issues if you were working with unsigned integers exclusively. Although I'm curious about what other code you were referencing with this: "But seeing how each change both made the code easier to reason about and more correct, I couldn’t deny the evidence."
With regards to modulo, in Zig if you try to use it with a signed integer it will tell you to specify whether you want `@mod` or `@rem` semantics. In my case, I'd almost never write `x % 2`, I'd write `x & 1`. I do use unsigned division but I'd pretty much never write code that would emit the `div` instruction.
I'm not saying you're wrong though! Everyone has a different mind. If you attain higher correctness and understandability through using signed integers, that's great. I'm just saying I'm in the opposite camp.
Zig also differentiates between the wrapping and non-wrapping operators. The for loop example would toss a runtime error when the index underflowed in most compiler modes.
The if statement won't work since Zig would force a cast.
The tricky wrap sucks unless you use a power of 2. Then the Zig type can match (u4, u5, u7, etc.) and you would use wrapping arithmetic operators. And on smaller CPUs you NEED to use a power of 2 because division and mod are expensive.
I know language designers have a lot of trade-offs to consider... But I would say if you know a value will logically always be >= 0, better to have a type that reflects that.
The potential bugs listed would be prevented by, e.g. "x--" won't compile without explicitly supplying a case for x==0 OR by using some more verbose methods like "decrement_with_wrap".
The trade-off is lack of C-like concise code, but more safe and explicit.
> But I would say if you know a value will logically always be >= 0, better to have a type that reflects that.
Except that's not quite what unsigned types do. They are not (just) numbers that will always be >= 0, but numbers where the value of `1 - 2` is > 1 and depends on the type. This is not an accident but how these types are intended to behave because what they express is that you want modular arithmetic, not non-negative integers.
> e.g. "x--" won't compile without explicitly supplying a case for x==0
If you want non-negative types (which, again, is not what unsigned types are for) you also run into difficulties with `x - y`. It's not so simple.
There are many useful constraints that you might think it's "better to have a type that reflects that" - what about variables that can only ever be even? - but it's often easier said than done.
That's true for signed numbers too though? `int_min - 2 > int_min`
I agree they're a bit more error-prone in practice, but I suspect a huge part of that is because people are so used to signed numbers because they're usually the default (and thus most examples assume signed, if they handle extreme values correctly at all (much example code does not)). And, legitimately, zero is a more commonly-encountered value... but that can push errors to occur sooner, which is generally a desirable thing.
4 replies →
> you also run into difficulties with `x - y`.
If you have "uint x" and "uint y", then for "x - y", the programmer should explicitly write two cases (a) no underflow, i.e. x >= y, and (b) underflow, x < y. The syntax for that... that is an open question.
> what about variables that can only ever be even
Yes, maybe you should have an "EvenInt" type, if that is important. Maybe you should be able to declare a variable to be 7...13, just like a "uint8" can declare something 0...255. Of course, the type-checker can get complicated, and perhaps simply fail to type-check some things. But, having compile-time constraints to what you know your variables will be is good, IMHO.
This is true, which means that a language has to be designed from the ground up to deal with these problems or there will always be inscrutable bugs due to misuse of arithmetic results. A simple example in a c-like language would be that the following function would not compile:
but this would:
Assuming 32 bit unsigned and int, the type of c should be computed as the range [-0xffffffff, 0xffffffff], which is different from int [-0x100000000, 0x7fffffff]. Subtle things like this are why I think it is generally a mistake to type annotate the result of a numerical calculation when the compiler can compute it precisely for you.
1 reply →
Note that in Zig, unsigned integer have the sqle semantic qs integers on overflow (trap or wrap or UB). You also have operators providing wrapping. That is the correct solution.
I think it should be alike in Pascal where you have size ranges as types, and then, you can declare that this collection fall on this range (and very nicely, you can make it at enum):
https://www.freepascal.org/docs-html/ref/refsu4.html
And none of those "amenities" existed anymore in Oberon.
Is the text on this page really #bbbdc3 on #ffffff? How is anyone supposed to be able to read that?
Weirdly, you have to turn on javascript for the text color to change...
For me it's #353841 on #ffffff which meets WCAG AAA standards for accessible text.
So his compiler cannot detect the unsigned overflows and instead chooses to call it a user mistake!
Sizes and indices of course need to be unsigned, and any self respecting compiler should warn about dangerous usage.
> If sizes are unsigned, like in C, C++, Rust and Zig – then it follows that anything involving indexing into data will need to either be all unsigned or require casts.
I don’t really get this claim. Indexing should just look up the element corresponding to the value provided. It’s easy to come up with semantics that are intuitive and sound, even if signed integers or ones smaller than size_t are used.
Indexing does that, but the indices must vary in a certain range, whose limits are frequently determined by using something like "sizeof(array)/sizeof(element)" which is an unsigned number.
This is especially inconvenient in C, where there exist extremely dangerous legacy implicit casts between signed integers and unsigned integers, which have a great probability of generating incorrect values.
Because the index is typically a signed integer, comparing it with an unsigned limit without using explicit casts is likely to cause bugs. Using explicit casts of smaller unsigned integers towards bigger signed integers results in correct code, but it is cumbersome.
These problems are avoided as said in TFA, by making "sizeof" and the like to have 64-bit signed integer values, instead of unsigned values.
Well chosen implicit conversions are good for a programming language, by reducing unnecessary verbosity, but the implicit integer conversions of C are just wrong and they are by far the worst mistake of C much worse than any other C feature.
Other C features are criticized because they may be misused by inexperienced or careless programmers, but most of the implicit integer conversions are just incorrect. There is no way of using them correctly. Only the conversions from a smaller signed integer to a bigger signed integer are correct.
Mixed signedness conversions have always been wrong and the conversions between unsigned integers have been made wrong by the change in the C standard that has decided that the unsigned integers are integer residues modulo 2^N and they are not non-negative integers.
For modular integers, the only correct conversions are from bigger numbers to smaller numbers, i.e. the opposite of the implicit conversions of C. The implicit conversions of C unsigned numbers would have been correct for non-negative integers, but in the current C standard there are no such numbers.
The current C standard is inconsistent, because the meaning of sizeof is of a non-negative integer and this is also true for the conversions between unsigned numbers, but all the arithmetic operations with unsigned numbers are defined to be operations with integer residues, not operations with non-negative numbers.
The hardware of most processors implements at least 3 kinds of arithmetic operations: operations with signed integers, operations with non-negative integers and operations with integer residues.
Any decent programming language should define distinct types for these kinds of numbers, otherwise the only way to use completely the processor hardware is to use assembly language. Because C does not do this, you have to use at least inline assembly, if not separate assembly source files, for implementing operations with big numbers.
Not sure what change in the C standard you mean. unsingned was always modulo. Otherwise, use -Wsign-conversion.
2 replies →
I don’t understand how dealing with numbers correctly is not a solved problem in computer engineering by now.
Maybe it's telling you it's a hard problem?
My comment was a bit tongue in cheek. Obviously it is a hard problem. But in a profession where we work with machines that literally were made to crunch numbers, and where abstraction is something we deal with daily, why can’t we have a performant abstraction for doing arbitrary calculations? The answer is that to be performant it must be solved in hardware, which would cost more than the hardware we have.
So in fact it is not just telling me it’s a hard problem, it’s telling me that the cost-benefit is still not there. It’s like it’s just not a very important problem (in an economic sense). And that is what surprises me, given that computers were made to do arbitrary calculations.
1 reply →
I don't get it. Is this a parody of poor design decisions?
Sure, it's possible to write bugs in C. And if you really want to, you can disable the compiler warnings which flag tautologous comparisons and mixed-sign comparisons (a common reason for doing this is to avoid spurious warnings in generic-type code).
But, uhh, "people can deliberately write bugs" has got to be the weakest justification I've ever seen for changing a language feature -- especially one as fundamental as "sizes of objects can't be negative".
The C language does not have any data type that has the property "can't be negative".
Signed integers can be negative. The so-called "unsigned" integers of C are integer residues modulo 2^N, which are neither positive nor negative, i.e. these concepts are not applicable to "unsigned" integers.
An alternative view is that any C "unsigned" is both positive and negative. For example the unsigned short "1" is the same number as "65537" and as "-65535".
So any sizeof value in C is negative (while also being positive).
In contradiction with what you say, the change described in TFA, by making sizes 64-bit signed integers, is the only method to guarantee that the sizes are non-negative in a language that does not have dedicated non-negative integers.
Other programming languages have non-negative integers, but C and C++ and many languages derived from them do not have such integers.
The arithmetic operations with non-negative integers differ from the arithmetic operations of C. On overflows and underflows, they either generate exceptions or have saturating behavior.
> An alternative view is that any C "unsigned" is both positive and negative. For example the unsigned short "1" is the same number as "65537" and as "-65535".
This can be disproven by the fact that dividing by `unsigned e = 1U` is well defined and always yields the starting number. If the unsigned numbers were really modular numbers as you suggest, division could not be defined.
1 reply →
Are you claiming that the following program could possibly print "-1" ?
If not, why?
Leaving aside the fact that, yes, unsigned integer types are definitely not negative -- my point wasn't about types at all. Objects cannot take up a negative number of bytes of memory!
It seems like they've identified common bugs patterns in C that would have been ameliorated by using signed, but come to the wrong conclusion that signed is the correct answer rather than that C is poorly designed for making the broken code the easy option.
Fix the language. Don't hack around it by using the wrong type.
This is already fixed in c via bitint types and disabling implicit integer sign casting.
I hate using languages that only have signed integers. Using integers that can’t be negative fits many problems nicely and avoids the edge case of having to check for negative.
You are perfectly right, but neither C nor C++ nor many more recent languages derived from them have non-negative integers.
The so-called "unsigned" integers of C are integer residues, where each value can be interpreted either as both positive and negative or as neither positive nor negative. In any case no "unsigned" value can be said to be non-negative.
You have to go back to languages not contaminated by C, like Ada, to find true non-negative integers among the primitive data types.
In C++, it is possible to define a non-negative integer type, which can have good performance if you implement its operations in assembly language.
However I am not aware of an open-source library including such a type.
I really appreciate your comments in this thread adrian_b. Could you point me at a brief summary of how Ada (or Pascal?) non-negative ints work? What is a compile error, what is a guaranteed run-time error, etc.
It's not "can't be negative", it's just that the semantics for negativity is wrapping around.
And - yes, there are very important use cases for unsigned/modulo-2n/wraparound values. But sizes of data structures are generally _not_ one of those use cases. The fact that the size is non-negative does not mean that the type should be unsigned. You should still be able to, say, subtract sizes and get a signed value which may be negative.
That’s definitely not true. Unsigned ints have no “negativity” semantic. Wrapping around is what happens when you decrement the minimum value of any integer type, including signed types. Regardless of the type you use to represent an integer value that cannot legally be negative, you will have to take care not to allow your program to return values lower than zero for things like indices or sizes.
6 replies →