Comment by pera

23 days ago

Has anyone tried to rewrite some popular open source project with IA? I imagine modern LLMs can be very effective at license-washing/plagiarizing dependencies, it could be an interesting new benchmark too

19 comments

pera

gorkaerana 23 days ago

I think it's fair enough to consider porting a subset of rewriting, in which case there are several successful experiments out there:

- JustHTML [1], which in practice [2] is a port of html5ever [3] to Python.

- justjshtml, which is a port of JustHTML to JavaScript :D [4].

- MiniJinja [5] was recently ported to Go [6].

All three projects have one thing in common: comprehensive test suites which were used to guardrail and guide AI.

References:

1. https://github.com/EmilStenstrom/justhtml

2. https://friendlybit.com/python/writing-justhtml-with-coding-...

3. https://github.com/servo/html5ever

4. https://simonwillison.net/2025/Dec/15/porting-justhtml/

5. https://github.com/mitsuhiko/minijinja

6. https://lucumr.pocoo.org/2026/1/14/minijinja-go-port/

EmilStenstrom 23 days ago

As the author, it's a stretch to say that JustHTML is a port of html5ever. While you're right that this was part of the initial prompt, the code is very different, which is typically not what counts as "port". Your mileage may wary.
daxfohl 23 days ago
Interesting, IIUC the transformer architecture / attention mechanism were initially designed for use in the language translation domain. Maybe after peeling back a few layers, that's still all they're really doing.
- nathan_compton 23 days ago
  
  This has long been how I have explained LLMs to non-technical people: text transformation engines. To some extent, many common, tedious, activities basically constitute a transformation of text into one well known form from another (even some kinds of reasoning are this) and so LLMs are very useful. But they just transform text between well known forms.
  
  1 reply →
MrJohz 23 days ago

Note that it's not clear that any of the JustHTML ports were actually ports per se, as in the end they all ended up with very different implementations. Instead, it might just be that an LLM generated roughly the same library several different times.
See https://felix.dognebula.com/art/html-parsers-in-portland.htm...
DonHopkins 23 days ago

More vibe coded browser modules:
V8 => H8 - JavaScript engine that hates code, misunderstands equality, sponsored by Brendan Eich and "Yes on Prop H8".
Expat => Vexpat - An annoying, irritating rewrite of an XML parser.
libxml2 => libxmlpoo - XML parsing, same quality as the spec.
libxslt => libxsalt - XSLT transforms with extra salt in the wound.
Protobuf => Probabuf - Probably serializes correctly, probably not, fuzzy logic.
Cap'n Proto => Crap'n Proto - Zero-copy, zero quality.
cURL => cHURL - Throws requests violently serverward, projectile URLemitting.
SDL => STD - Sexually Transmitted Dependency. It never leaves and spreads bugs to everything you touch.
Servo => Swervo - Drunk, wobbly layout that can't stay on the road.
WebKit => WebShite - British pronunciation, British quality control.
Blink => Blinkered - Only renders pages it agrees with politically.
Taffy => Daffy - Duck typed Flexbox layout that's completely unhinged. "You're dethpicable!"
html5ever => html5never - Servo's HTML parser that never finishes tokenizing.
Skia => SkAI - AI-generated graphics that hallucinates extra pixels and fingers.
FreeType => FreeTypo - Introduces typos during keming and rasterization.
Firefox => Foxfire - Burns through your battery in 12 minutes, while molesting children.
WebGL => WebGLitch - Shader compilation errors as art.
WebGPU => WebGPUke - Makes your GPU physically ill.
SQLite => SQLHeavy - Embedded database, 400MB per query.
Vulkan => Vulcan't - Low-level graphics that can't.
Clang => Clanger - Drops errors loudly at runtime.
libevent => liebevent - Event library that lies about readiness.
Opus => Oops - Audio codec, "oops, your audio's gone."
All modules now available on GitPub:
GitHub => GitPub - Microsoft's vibe control system optimized for the Ballmer Peak. Commit quality peaks at 0.129% BAC, mass reverts at 0.15%.

benhoyt 23 days ago

Not me personally, but a GitHub user wrote a replacement for Go's regexp library that was "up to 3-3000x+ faster than stdlib": https://github.com/coregx/coregex ... at first I was impressed, so started testing it and reporting bugs, but as soon as I ran my own benchmarks, it all fell apart (https://github.com/coregx/coregex/issues/29). After some mostly-bot updates, that issue was closed. But someone else opened a very similar one recently (https://github.com/coregx/coregex/issues/79) -- same deal, "actually, it's slower than the stdlib in my tests". Basically AI slop with poor tests, poor benchmarks, and way oversold. How he's positioning these projects is the problematic bit, I reckon, not the use of AI.

Same user did a similar thing by creating an AWK interpreter written in Go using LLMs: https://github.com/kolkov/uawk -- as the creator of (I think?) the only AWK interpreter written in Go (https://github.com/benhoyt/goawk), I was curious. It turns out that if there's only one item in the training data (GoAWK), AI likes to copy and paste freely from the original. But again, it's poorly tested and poorly benchmarked.

I just don't see how one can get quality like this, without being realistic about code review, testing, and benchmarking.

dragonwriter 23 days ago
> up to 3-3000x+ faster than stdlib
Note that this is semantically exactly equivalent to "up to 3000x faster than stdlib" and doesn't actually claim any particular actual speedup since "up to" denotes an upper bound, not a lower bound or expected value. It’s standard misleading-but-not-technically-false marketing language to create a false impression because people tend to focus on the number and ignore the "up to".
- Dylan16807 23 days ago
  
  When you say "up to" about a list of data points, it's not just a bound. At least one has to reach that amount or it's a lie.
- arcticbull 23 days ago
  
  With the “up to 3-3000x+” language the plus leaves us with the entire number line.
- supriyo-biswas 23 days ago
  
  Reminds me of https://xkcd.com/870/
- nkrisc 23 days ago
  
  Saying “up to” means that bound is the maximum value of the data set. It may be far from the median value, but it is included (or you’re lying). With any other interpretation the phrase has no meaning whatsoever.
  
  1 reply →
- DonHopkins 23 days ago
  
  3000x Faster Optimized Random Number Generator: https://xkcd.com/221/
AlexeyBelov 22 days ago

Oh yeah, I recognize this guy. The author of most commits in coregex posted his vibecoded projects to Reddit.
I've looked at his other repos and it's the same shit. Responses are also quite funny, does he not realize this reads like the worst of AI?
CuriouslyC 23 days ago

To be fair, good benchmarking is hard, most people get it wrong. Scientific training helps.

hedgehog 23 days ago

I used one of the assistants to reverse and rewrite a browser-hosted JS game-like app to desktop Rust. It required a lot of steering but it was pretty useful.