← Back to context

Comment by marcodave

5 years ago

First month or so at my new employer, big consultancy firm for a financial institution. Had a fairly complex distributed monolithic application integrated with Tibco EMS, Oracle DB and distributed XE transactions.

Regularly, but randomly, in production, after receiving a good amount of messages in the input queues, (which then got rerouted to other event queues for parallel processing) some DB transactions simply were getting stuck. Not rolled back, but stuck in limbo -- after a while the DB simply refused new transactions because so many were stuck. Nobody got a clue on why that was happening, it meant regular manual restart of the services and re-feeding of the failing messages. Users started to get fed up and the project threatened to fail.

Got into it, after couple of weeks of investigations and trial and errors with all possible weird flags, turned out that the version of Tibco EMS had a wierd behavior with distributed transaction when the queues got full of messages (queues had 50MB size limit).

Instead of rolling back gracefully the JMS+JDBC XE transaction, it...kinda exited with an IO error.

Turned out that newer versions of Tibco EMS fixed that issue, but no way to ask ops to install that new version. Since upgrading was out of the question, the actual fix was to enable message compression to limit the size of the messages coming into the queues, turned out that the XML we sent there were up to 1.5MB (!)

After discovering that, became basically a war hero and respected by the client as the "savior of the project". Good times.

Your compression workaround reminded me of an issue I ran into a while back.

My team at work uses a reporting tool for vulnerability assessments and pen-tests; basically you can import a bunch of data files, review it in the web app, and generate a report.

I would run into cases where I couldn't upload one of my data files. The web app is JS-heavy, lots of things going on in the background without much visible feedback. It turns out that the programmers had implemented the upload as this async task with a hard-coded timeout for completion, and they likely wrote it while they had great network speed.

I'm on DSL, and generally, it gets the job done. However, upload speed is only 1Mbit/s, so with a big file, my upload would time out. It's hard-coded remember, so it didn't matter that it was still functioning when it got clobbered.

It occurred to me that some file formats, like WAR or Office documents, are basically Zip archives under the hood, so I put my large XML file into one, and tried that.... and it worked! Something on the back-end quietly unzipped my upload and imported the file it contained.

Funnier is that when I mentioned it to the devs, this behaviour was not something they expected. Probably built into a library they use.