Modern x86 CPUs have actual instructions for strcpy that work fairly well. There were several false starts along the way, but the performance is fine now.
They have instructions for memcpy/memmove (i.e. rep movs), not for strcpy.
They also have instructions for strlen (i.e. rep scasb), so you could implement strcpy with very few instructions by finding the length and then copying the string.
Executing first strlen, then validating the sizes and then copying with memcpy if possible is actually the recommended way for implementing a replacement for strcpy, inclusive in the parent article.
On modern Intel/AMD CPUs, "rep movs" is usually the optimal way to implement memcpy above some threshold of data size, e.g. on older AMD Zen 3 CPUs the threshold was 2 kB. I have not tested more recent CPUs to see if the threshold has diminished.
On the old AMD Zen 3 there was also a certain size range above 2 kB at sizes comparable with the L3 cache memory where their implementation interacted somehow badly with the cache and using "non-temporal" vector register transfers outperformed "rep movs". Despite that performance bug for certain string lengths, using "rep movs" for any size above 2 kB gave a good enough performance.
The spec and some sanitizers use a scalar loop (because they need to avoid mistakenly detecting UB), but real world libc seem unlikely to use a scalar loop.
Modern x86 CPUs have actual instructions for strcpy that work fairly well. There were several false starts along the way, but the performance is fine now.
They have instructions for memcpy/memmove (i.e. rep movs), not for strcpy.
They also have instructions for strlen (i.e. rep scasb), so you could implement strcpy with very few instructions by finding the length and then copying the string.
Executing first strlen, then validating the sizes and then copying with memcpy if possible is actually the recommended way for implementing a replacement for strcpy, inclusive in the parent article.
On modern Intel/AMD CPUs, "rep movs" is usually the optimal way to implement memcpy above some threshold of data size, e.g. on older AMD Zen 3 CPUs the threshold was 2 kB. I have not tested more recent CPUs to see if the threshold has diminished.
On the old AMD Zen 3 there was also a certain size range above 2 kB at sizes comparable with the L3 cache memory where their implementation interacted somehow badly with the cache and using "non-temporal" vector register transfers outperformed "rep movs". Despite that performance bug for certain string lengths, using "rep movs" for any size above 2 kB gave a good enough performance.
More recent CPUs might be better than that.
Whoops, this proves I’m not really a userspace assembly programmer…
But you can indeed safely read past the end if a buffer if you don’t cross a page boundary and you aren’t bound by the rules of, say, C.
X86-64 has the REP prefix for string operation. When combined with the MOVS instruction, that is pretty much an instruction for strcpy.
2 replies →
The spec and some sanitizers use a scalar loop (because they need to avoid mistakenly detecting UB), but real world libc seem unlikely to use a scalar loop.