Tag Archives: strings

Half-baked ideas: C strings with implicit length field

For more half-baked ideas, see the ideas tag.

If you prefer just to see the code, then it’s here.

Chris Siebenmann wrote a couple of interesting articles about C’s null terminated strings and how they pre-date C.

Chris notes an alternative is a length + string representation, as used in Pascal. Although there are libraries for this in C, there are several drawbacks and approximately no one uses them.

However it’s possible to have the best of both worlds: Strings using an implicit length field that takes up no extra storage. These strings are backwards compatible with ordinary C strings — you can literally pass them to legacy functions or cast them to char * — yet the equivalent of a strlen operation is O(1).

There are two ideas here: Firstly, when you use the C malloc function, malloc stashes some extra metadata about your allocation, and with most malloc implementations there is a function to obtain the size of the allocation from a pointer. In glibc, the function is called malloc_usable_size. Note that because of alignment concerns, the amount allocated is usually larger than the amount you originally requested.

The second idea comes from OCaml. OCaml stores strings in a clever internal representation which is both backwards compatible with C (a fancy way to say they are null terminated), and it allows you to get the real length of the string even though OCaml — like C — allocates more than requested for alignment reasons.

So here’s how we do it: When allocating an “implicit length string” (ilenstr) we store extra data in the final byte of the “full” malloced space, in the byte marked B in the diagram below:

| the string              | \0 |   ....     | B  |
<----- malloc we requested ---->
<----------- malloc actually allocated ---------->

If malloc allocated exactly the same amount of space as is used by our string + terminating null, then B is simply the terminating \0:

| the string              | \0 |

If malloc allocated 1 spare byte, we store B = 1:

| the string              | \0 | 1  |

If malloc allocated 4 spare bytes, we store B = 4:

| the string              | \0 |   ....       | 4  |

Getting the true length of the string is simply a matter of asking malloc for the allocated length (ie. calling malloc_usable_size), finding the last byte (B) and subtracting it. So we can get the true string length in an O(1) operation (usually, although this may depend on your malloc implementation).

ilenstr strings can contain \0 characters within the string.

ilenstr strings are also backwards compatible, in that we can pass one to any “legacy” C function, and assuming the string itself doesn’t contain any \0 inside it, everything just works.

Alright. This is terrible. DO NOT USE IT IN PRODUCTION CODE! It breaks all kinds of standards, is unportable etc. There are security issues with allowing \0-containing strings to be passed to legacy functions. Still, it’s a nice idea. With proper cooperation from libc, standards authorities and so on, it could be made to work.

Here is my git repo:



Filed under Uncategorized