Comment by TeMPOraL
4 years ago
You're joking, but now I'm thinking about the XML we parse at work and the library we're using to do it. We parse a lot of it, but I've always had this vague feeling that it takes a bit too long (given the codebase is C++).
The XML library we use is rather well-known, so if someone found a bug like this there, I'd suspect a general improvement of performance across the board in the entire industry. Efficient Market Hypothesis tells me it's unlikely the library has this problem, but then again, so I thought about AAA videogames, and then GTA Online thing came out.
> it's unlikely the library has this problem
Any sufficiently-complex library code likely has plenty of problems, often unavoidably so (e.g. trade-offs between best performance and edge cases). Whether they have been found or not is a function of many, many factors.
> Efficient Market Hypothesis
I've lived long enough to be very sceptical about that sort of thing. Markets tend to be efficient in aggregate, maybe, but on the single case they can fail quite brutally. Look at how "dramatic" bugs are overlooked even in critical pieces of infrastructure like openssl, for years and years; maybe it happens less for openssl than most equivalent libraries, but it still happens.
Also, once the "market" for standards moves on, network effects make it very hard to have any meaningful competition. I mean, who writes XML parsers nowadays? Whichever XML lib was winning when JSON "happened" is now likely to stay in control of that particular segment; and the likelihood that top developers will keep reviewing it, falls off a cliff. Sprinkle a bit of cargo-cultism on top, and "efficient markets" become almost a cruel joke.
> I've lived long enough to be very sceptical about that sort of thing.
I've also seen this story unfold too many times:
code code code build run fail
> dammit, I could have sworn this was correct?!
think think think GOTO 1
> no way, my code has to be wrong, this can't be the compiler?! it's never the compiler! right?
reduce code to a generic two liner, build, run, fail
> oh.
open support ticket at compiler vendor
There's a variant / corollary of the Efficient Market Hypothesis here, though.
Let's say the GP's XML library has The GTA Bug, i.e. it uses a quadratic-performance loop when parsing. The bug will go undiscovered until any one consumer of the library a) sees enough performance impact to care, b) has the expertise to profile their application and finds that the library is at fault, and c) reports the problem back to the library owner so that it can be fixed. This combination might be unlikely but since only one consumer has to have all those properties, the probability scales inversely with the number of library users.
It's possible. I've personally reduced the time spent for reading huge XML file on the startup of an application at least 10 times in the application I was in charge of, by avoiding the library dependence and writing a custom code. Having a lot of experience in such kinds of code and in the performance issues, it was quite a fast change with no negative effects.
The prehistory of that was simple: up to some point the amount of data stored was reasonably small. Then from some point on the amount of data grew significantly (a few orders of magnitude), and the startup times became very unpleasant.
There's a lot that is going on when loading huge XML files. As an example, don't forget all the possible Unicode conversions, all the possible allocations of the elements in the handling code, just to be discarded etc.
I don't suggest everybody doing it "just because" but if some specific use is known to have very specific assumptions and it is in the "hot path" and really dominates (profile first!) and it is known that only a small subset of all XML possibilities will ever be used it can be justified to avoid the heavy libraries. For example, in that specific case, I knew that the XML is practically always only read and written by the application, or by somebody who knew what he was doing, and not something that somebody random in some random form would regularly provide from the outside, and I knew that my change surely won't break anything for years to come, as I knew for sure that that part of application was not the "hot spot" of expected future changes.
So it was a win-win. Immensely faster application startup, which is something that improved everybody's work, while preserving the "readability" of that file for the infrequent manual editing or control (and easy diff).
I'm reminded of a 2008 article, Why is D/Tango so fast at parsing XML? [0]
One of the main factors seems to be that a lot of XML parser libraries, even the high-profile ones, did a lot of unnecessary copy operations. D's language features made it easy and safe to avoid unnecessary copying.
I wonder what became of that Tango code.
[0] https://web.archive.org/web/20140821164709/http://dotnot.org... , see also reddit discussion where WalterBright makes an appearance, https://old.reddit.com/r/programming/comments/6bt6n/why_is_d...
If you have a lot of nesting in the XML, and it is formatted for human reading (i.e. indented), you may want to consider not doing that. We had a project where we were creating human-readable versions of the XML (mostly for developer convenience) and then parsing it. When we stopped adding all the extra white space the parsing speed increased a couple of orders of magnitude. (The downside was we no longer had built in coffee breaks in our development process.)
That's interesting. I can't think of a mechanism why this would give so much of a performance boost, though - rejecting extra whitespace should be just a matter of a simple forward scan against a small set of characters, shouldn't it?
(Or maybe in your case something was running strlen() a lot during parsing, and just the difference in file size explains the boost?)
What about parsing that XML upfront, serialising to some binary format (e.g. CBOR, maybe with nlohmann's JSON library, or Cap'n Proto) and shipping the binary file?
Would be cool if we could that, but as things stand, enough various people want to occasionally look at these files, in environments where they can't just install specialized tooling and are using notepad.exe (or Notepad++ if already available), that we keep it text.
I like binary formats, but we can't afford the increased complexity around supporting a custom binary format, so I'm not pushing for changes here.
I did investigate replacing our pile of XML files with an SQLite database, which would give us fast and efficient format, and allow to use existing SQLite database viewers, or hit the file with trivial scripts, so we'd have no complexity supporting a dedicated tool. However, the data model we use would need such a large overhaul (and related retraining) that we tabled this proposal for now.
I wonder of scanf on Playstation was not using strlen in that way. GTA was written for PS right?
It also runs on PC