Comment by ruhith

15 hours ago

The real punchline is that this is a perfect example of "just enough knowledge to be dangerous." Whoever processed these emails knew enough to know emails aren't plain text, but not enough to know that quoted-printable decoding isn't something you hand-roll with find-and-replace. It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't, and then you get congressional evidence full of mystery equals signs.

34 comments

ruhith

lvncelot 14 hours ago

> It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't

I'm sure you already know this one, but for anyone else reading this I can share my favourite StackOverflow answer of all time: https://stackoverflow.com/a/1732454

josefx 13 hours ago
I prefer the question about CPU pipelines that gets explained using a railroad switch as example. That one does a decent job of answering the question instead of going of on a, how to best put it, mentally deranged one page rant about regexes with the lazy throw away line at the end being the only thing that makes it qualify as an answer at all.
- kapep 13 hours ago
  
  The regex answer is from the very old days of Stackoverflow, before fun was banned. I agree it barely qualifies as answer, but considering that the question has over 4 million page views (which almost puts it in the top 100 most viewed questions all-time), it has reached a lot people. The answer probably had much more influence than any serious answer on that topic. So I'd say the author did a good job.
  
  4 replies →
- MrGilbert 13 hours ago
  
  For anyone wondering about the railroad switch post: https://stackoverflow.com/questions/11227809/why-is-processi...
  
  1 reply →
- bityard 10 hours ago
  
  But--and this is crucial--the one about regexes is hilarious.
  It also comes from a time in Internet culture when humor was appreciated instead of aggressively downvoted.
  
  1 reply →
perching_aix 9 hours ago
It took me years to notice, but did you catch that the answer actually subtly misinterprets what the question is asking for?
Guy (in my reading) appears to talk about matching an entire HTML document with regex. Indeed, that is not possible due to the grammars involved. But that is not what was being asked.
What was being asked is whether the individual HTML tags can be parsed via regex. And to my understanding those are very much workable, and there's no grammar capability mismatch either.
- somat 8 hours ago
  
  The thing is, even when parsing html "correctly" (whatever that is) regexes will still be used. Sure, There will be a bunch of additional structures and mechanisms involved, but you will be identifying tokens via a bunch of regexes.
  So yes, while it is an inspired comidic genius of a rant, and sort of informative in that it opens your eyes to the limitations of regexes, it sort of brushes under the rug all the places that those poor maligned regular expressions will be used when parsing html.
- tiagod 9 hours ago
  
  I think even for single opening tags like asked there are impossible edge cases.
  For example, this is perfectly valid XHTML:
  <a href="/" title="<a /> />"></a>
  
  6 replies →
bayesnet 13 hours ago
I know this is grumpy but this I’ve never liked this answer. It is a perfect encapsulation of the elitism in the SO community—if you’re new, your questions are closed and your answers are edited and downvoted. Meanwhile this is tolerated only because it’s posted by a member with high rep and username recognition.
- 1718627440 13 hours ago
  
  I think this answer was tolerated when SO wasn't as bad as it is now, and wouldn't be tolerated now from anyone.
  
  1 reply →
- throwaway_61235 12 hours ago
  
  As someone who used to write custom crawlers 20 years ago, I can confirm that regular expressions worked great. All my crawlers were custom designed for a page and the sites were mostly generated by some CMS and had consistent HTML. I don't remember having to do much bug fixes that were related to regular expression issues.
  I don't suggest writing generic HTML parsers that works with any site, but for custom crawlers they work great.
  Not to say that the tools available are the same now as 20 years ago. Today I would probably use puppeteer or some similar tool and query the DOM instead.
  
  2 replies →
Cthulhu_ 13 hours ago

HE COMES
umanwizard 10 hours ago
Funny how differently people can perceive things. That's my least favorite SO answer of all time, and I cringe every time I see it.
It's a very bad answer. First of all, processing HTML with regex can be perfectly acceptable depending on what you're trying to do. Yes, this doesn't include full-blown "parsing" of arbitrary HTML, but there are plenty of ways in which you might want to process or transform HTML that either don't require producing a parse tree, don't require perfect accuracy, or are operating on HTML whose structure is constrained and known in advance. Second, it doesn't even attempt to explain to OP why parsing arbitrary HTML with regex is impossible or poorly-advised.
The OP didn't want his post to be taken over by someone hamming it up with an attempt at creative writing. He wanted a useful answer. Yes, this answer is "quirky" and "whimsical" and "fun" but I read those as euphemisms for "trying to conscript unwilling victims into your personal sense of nerd-humor".
- chucksmash 10 hours ago
  
  There's nothing that brings joy into this world quite like the guy waiting around to tell people he doesn't like the thing they like.
- philistine 10 hours ago
  
  The whole argument hinges on one word in your post: arbitrary.
  I parse my own HTML I produce directly in a context where I fully control the output. It works fine, but parsing other people’s HTML is a lesson in humility. I’ve also done that, but I did it as a one time thing. I parsed a specific point in time, refusing to change that at any point.
  
  2 replies →

ErigmolCt 10 hours ago

And because the output still looks mostly readable, nobody questions it until years later when it's suddenly evidence in front of Congress

V__ 14 hours ago

They have top men working on it right now.