Comment by throwup238

1 day ago

I really hope they have because I’ve also been experimenting with LLMs to automate searching through old archival handwritten documents. I’m interested in the Conquistadors and their extensive accounts of their expeditions, but holy cow reading 16th century handwritten Spanish and translating it at the same time is a nightmare, requiring a ton of expertise and inside field knowledge. It doesn’t help that they were often written in the field by semi-literate people who misused lots of words. Even the simplest accounts require quite a lot of detective work to decipher with subtle signals like that pound sign for the sugar loaf.

> Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts.

This I’m a lot more skeptical of. The linked twitter post just looks like something it would replicate via HTML/CSS/JS. Whats the kernel look like?

>I’m interested in the Conquistadors and their extensive accounts of their expeditions, but holy cow reading 16th century handwritten Spanish and translating it at the same time is a nightmare, requiring a ton of expertise and inside field knowledge

Completely off topic, but out of curiosity, where are you reading these documents? As a Spaniard I’m kinda interested.

  • I use the Portal de Archivos Españoles [1] for Spanish colonial documents. Each country has their own archive but the Spanish one has the most content (35 million digitized pages)

    The hard part is knowing where to look since most of the images haven’t gone through HRT/OCR or indexing so you have to understand Spanish colonial administration and go through the collections to find stuff.

    [1] https://pares.cultura.gob.es/pares/en/inicio.html

You are right to be skeptical.

There are plenty of so called windows(or other) web 'os' clones.

There were a couple of these posted on HN actually this very year.

Here is one example I google dthat was also on HN : https://news.ycombinator.com/item?id=44088777

This is not an OS as in emulating a kernel in javascript or wasm, this is making a web app that looks like the desktop of an OS.

I have seen plenty such projects, some mimick windows UI entirely, you xan find them via google.

So this was definitely in the training data, and is not as impressive as the blog post or the twitter thread make it to be.

The scary thing is the replies in the twitter thread have no critical thinking at all and are impressed beyond belief, they think it coded a whole kernel, os, made an interpeter for it, ported games etc.

I think this is the reason why some people are so impressed by AI, when you can only judge an app visually or only how you intetcat with it and don't have the depth of knowledge to understand, for such people it works all the way.land AI seems magical beyond comprehension.

But all this is only superficial IMHO.

  • Every time a model is about to be released, there are a bunch of these hype accounts that spin up. I don't know they get paid or they spring up organically to farm engagement. Last time there was such hype for a model was "strawberry" (o1) then gpt-5, and both turned out to be meaningful improvements but nowhere near the hype.

    I don't doubt though that new models will be very good at frontend webdev. In fact this is explicitly one of the recent lmarena tasks so all the labs have probably been optimizing for it.

    • My guess is that there are insiders who know about the models and can’t keep their mouths shut. They like being on the inside and leaking.

      1 reply →

  • Its always amusing when "an app like windows xp" considered hard or challenging somehow.

    Literally the most basic html/css, not sure why it is even included in benchmarks.

    • While it is obviously much easier than creating a real OS, some people have created desktop managers web apps, with resizeable and movable windows, apps such as terminals, nodepads, file explorer etc.

      This is still a challenging task and requires lots of work to get this far.

    • Those things are LLMs, with text and language at the core of their capabilities. UIs are, notably, not text.

      An LLM being able to build up interfaces that look recognizably like an UI from a real OS? That sure suggests a degree of multimodal understanding.

      1 reply →

My language does not use Latin letters, but they are separate letters. Is there a way to train some handwriting recognition on my own handwriting in my own language, such that it will be effective and useful? I mostly need to recognize text in PDF documents, generated by writing on an e-ink tablet with an EMR stylus.

> This I’m a lot more skeptical of. The linked twitter post just looks like something it would replicate via HTML/CSS/JS. Whats the kernel look like?

Thanks for this, I was almost convinced and about to re-think my entire perspective and experience with LLMs.

I'd love to find more info on this but from what I can find it seems to be making webpages that look like those product, and seemingly can "run python" or "emulate a game" but writing something that, based on all of GitHub, can approximate an iPhone or emulator in javscript/css/HTML is very very very different than writing an OS.

> Whats the kernel look like?

Those clones are all HTML/CSS, same for game clones made by Gemini.

Oh! That's a nice use-case and not too far from stuff I have been playing with! (happily I do not have to deal with handwriting, just bad scans of older newspapers and texts)

I can vouch for the fact that LLMs are great at searching in the original language, summarizing key points to let you know whether a document might be of interest, then providing you with a translation where you need one.

The fun part has been build tools to turn Claude code and Codex CLI into capable research assistant for that type of projects.

  • > The fun part has been build tools to turn Claude code and Codex CLI into capable research assistant for that type of projects.

    What does that look like? How well does it work?

    I ended up writing a research TUI with my own higher level orchestration (basically have the thing keep working in a loop until a budget has been reached) and document extraction.

    • I started with a UI that sounded like it was built along the same lines as yours, which had the advantage of letting me enforce a pipeline and exhaustivity of search (I don't want the 10 most promising documents, I want all of them).

      But I realized I was not using it much because it was that big and inflexible (plus I keep wanting to stamp out all the bugs, which I do not have the time to do on a hobby project). So I ended up extracting it into MCPs (equipped to do full-text search and download OCR from the various databases I care about) and AGENTS.md files (defining pipelines, as well as patterns for both searching behavior and reporting of results). I also put together a sub-agent for translation (cutting away all tools besides reading and writing files, and giving it some document-specific contextual information).

      That lets me use Claude Code and Codex CLI (which, anecdotally, I have found to be the better of the two for that kind of work; it seems to deal better with longer inputs produced by searches) as the driver, telling them what I am researching and maybe how I would structure the search, then letting them run in the background before checking their report and steering the search based on that.

      It is not perfect (if a search surfaces 300 promising documents, it will not check all of them, and it often misunderstands things due to lacking further context), but I now find myself reaching for it regularly, and I polish out problems one at a time. The next goal is to add more data sources and to maybe unify things further.

      1 reply →

I'm skeptical that they're actually capable of making something novel. There are thousands of hobby operating systems and video game emulators on github for it to train off of so it's not particularly surprising that it can copy somebody else's homework.

  • I remain confused but still somewhat interested as to a definition of "novel", given how often this idea is wielded in the AI context. How is everyone so good at identifying "novel"?

    For example, I can't wrap my head around how a) a human could come up with a piece of writing that inarguably reads "novel" writing, while b) an AI could be guaranteed to not be able to do the same, under the same standard.

    • Generally novel either refers to something that is new, or a certain type of literature. If the AI is generating something functionally equivalent to a program in its training set (in this case, dozens or even hundreds of such programs) then it by definition cannot be novel.

      33 replies →

    • If the model can map an unseen problem to something in its latent space, solve it there, map back and deliver an ultimately correct solution, is it novel? Genuine question, ‘novel’ doesn’t seem to have a universally accepted definition here

      1 reply →

    • > For example, I can't wrap my head around how a) a human could come up with a piece of writing that inarguably reads "novel" writing, while b) an AI could be guaranteed to not be able to do the same, under the same standard.

      The secret ingredient is the world outside, and past experiences from the world, which are unique for each human. We stumble onto novelty in the environment. But AI can do that too - move 37 AlphaGo is an example, much stumbling around leads to discoveries even for AI. The environment is the key.

    • A system of humans creates bona fide novel writing. We don’t know which human is responsible for the novelty in homoerotic fanfiction of the Odyssey, but it wasn’t a lizard. LLMs don’t have this system-of-thinkers bootstrapping effect yet, or if they do it requires an absolutely enormous boost to get going

    • Because we know that the human only read, say, fifty books since they were born, and watched a few thousand videos, and there is nothing in them which resembles what they wrote.

  • Doing something novel is incredibly difficult through LLM work alone. Dreaming, hallucinating, might eventually make novel possible but it has to be backed up be rock solid base work. We aren't there yet.

    The working memory it holds is still extremely small compared to what we would need for regular open ended tasks.

    Yes there are outliers and I'm not being specific enough but I can't type that much right now.

  • I believe they can create a novel instance of a system from a sufficient number of relevant references - i.e. implement a set of already-known features without (much) code duplication. LLMs are certainly capable of this level of generalization due to their huge non-relevant reference set. Whether they can expand beyond that into something truly novel from a feature/functionality standpoint is a whole other, and less well-defined, question. I tend to agree that they are closed systems relative to their corpus. But then, aren't we? I feel like the aperture for true novelty to enter is vanishingly small, and cultures put a premium on it vis-a-vis the arts, technological innovation, etc. Almost every human endeavor is just copying and iterating on prior examples.

    • Almost all of the work in making a new operating system or a gameboy emulator or something is in characterizing the problem space and defining the solution. How do you know what such and such instruction does? What is the ideal way to handle this memory structure here? You know, knowledge you gain from spending time tracking down a specific bug or optimizing a subroutine.

      When I create something, it's an exploratory process. I don't just guess what I am going to do based on my previous step and hope it comes out good on the first try. Let's say I decide to make a car with 5 wheels. I would go through several chassis designs, different engine configurations until I eventually had something that works well. Maybe some are too weak, some too expensive, some are too complicated. Maybe some prototypes get to the physical testing stage while others don't. Finally, I publish this design for other people to work on.

      If you ask the LLM to work on a novel concept it hasn't been trained on, it will usually spit out some nonsense that either doesn't work or works poorly, or it will refuse to provide a specific enough solution. If it has been trained on previous work, it will spit out something that looks similar to the solved problem in its training set.

      These AI systems don't undergo the process of trial and error that suggests it is creating something novel. Its process of creation is not reactive with the environment. It is just cribbing off of extant solutions it's been trained on.

      1 reply →

    • Here's a thought experiment: if modern machine learning systems existed in the early 20th century, would they have been able to produce an equivalent to the theory of relativity? How about advance our understanding of the universe? Teach us about flight dynamics and take us into space? Invent the Turing machine, Von Neumann architecture, transistors?

      If yes, why aren't we seeing glimpses of such genius today? If we've truly invented artificial intelligence, and on our way to super and general intelligence, why aren't we seeing breakthroughs in all fields of science? Why are state of the art applications of this technology based on pattern recognition and applied statistics?

      Can we explain this by saying that we're only a few years into it, and that it's too early to expect fundamental breakthroughs? And that by 2027, or 2030, or surely by 2040, all of these things will suddenly materialize?

      I have my doubts.

      11 replies →

  • Of course they can come up with something novel. They're called hallucinations when they do, and that's something that can't be in their training data, because it's not true/doesn't exist. Of course, when they do come up totally novel hallucinations, suddenly being creative is a bad thing to be "fixed".

I'm surprised people didn't click through to the tweet.

https://x.com/chetaslua/status/1977936585522847768

> I asked it for windows web os as everyone asked me for it and the result is mind blowing , it even has python in terminal and we can play games and run code in it

And of course

> 3D design software, Nintendo emulators

No clue what these refer to but to be honest it sounds like they've incrementally improved one-shotting capabilities mostly. I wouldn't be surprised if Gemini 2.5 Pro could get a Gameboy or NES emulator working to boot Tetris or Mario, while it is a decent chunk of code to get things going, there's an absolute boatload of code on the Internet, and the complexity is lower than you might imagine. (I have written a couple of toy Gameboy emulators from scratch myself.)

Don't get me wrong, it is pretty cool that a machine can do this. A lot of work people do today just isn't that novel and if we can find a way to tame AI models to make them trustworthy enough for some tasks it's going to be an easy sell to just throw AI models at certain problems they excel at. I'm sure it's already happening though I think it still mostly isn't happening for code at least in part due to the inherent difficulty of making AI work effectively in existing large codebases.

But I will say that people are a little crazy sometimes. Yes it is very fascinating that an LLM, which is essentially an extremely fancy token predictor, can one-shot a web app that is mostly correct, apparently without any feedback, like being able to actually run the application or even see editor errors, at least as far as we know. This is genuinely really impressive and interesting, and not the aspect that I think anyone seeks to downplay. However, consider this: even as relatively simple as an NES is compared to even moderately newer machines, to make an NES emulator you have to know how an NES works and even have strategies for how to emulate it, which don't necessarily follow from just reading specifications or even NES program disassembly. The existence of many toy NES emulators and a very large amount of documentation for the NES hardware and inner workings on the Internet, as well as the 6502, means that LLMs have a lot of training data to help them out.

I think that these tasks which extremely well-covered in the training data gives people unrealistic expectations. You could probably pick a simpler machine that an LLM would do significantly worse at, even though a human who knows how to write emulation software could definitely do it. Not sure what to pick, but let's say SEGA's VMU units for the Dreamcast - very small, simple device, and I reckon there should be information about it online, but it's going to be somewhat limited. You might think, "But that's not fair. It's unlikely to be able to one-shot something like that without mistakes with so much less training data on the subject." Exactly. In the real world, that comes up. Not always, but often. If it didn't, programming would be an incredibly boring job. (For some people, it is, and these LLMs will probably be disrupting that...) That's not to say that AI models can never do things like debug an emulator or even do reverse engineering on its own, but it's increasingly clear that this won't emerge from strapping agents on top of transformers predicting tokens. But since there is a very large portion of work that is not very novel in the world, I can totally understand why everyone is trying to squeeze this model as far as it goes. Gemini and Claude are shockingly competent.

I believe many of the reasons people scoff at AI are fairly valid even if they don't always come from a rational mindset, and I try to keep my usage of AI to be relatively tasteful. I don't like AI art, and I personally don't like AI code. I find the push to put AI in everything incredibly annoying, and I worry about the clearly circular AI market, overhyped expectations. I dislike the way AI training has ripped up the Internet, violated people's trust, and lead to a more closed Internet. I dislike that sites like Reddit are capitalizing on all of the user-generated content that users submitted which made them rich in the first place, just to crap on them in the process.

But I think that LLMs are useful, and useful LLMs could definitely be created ethically, it's just that the current AI race has everyone freaking the fuck out. I continue to explore use cases. I find that LLMs have gotten increasingly good at analyzing disassembly, though it varies depending on how well-covered the machine is in its training data. I've also found that LLMs can one-shot useful utilities and do a decent job. I had an LLM one-shot a utility to dump the structure of a simple common file format so I could debug something... It probably only saved me about 15-30 minutes, but still, in that case I truly believe it did save me time, as I didn't spend any time tweaking the result; it did compile, and it did work correctly.

It's going to be troublesome to truly measure how good AI is. If you knew nothing about writing emulators, being able to synthesize an NES emulator that can at least boot a game may seem unbelievable, and to be sure it is obviously a stunning accomplishment from a PoV of scaling up LLMs. But what we're seeing is probably more a reflection of very good knowledge rather than very good intelligence. If we didn't have much written online about the NES or emulators at all, then it would be truly world-bending to have an AI model figure out everything it needs to know to write one on-the-fly. Humans can actually do stuff like that, which we know because humans had to do stuff like that. Today, I reckon most people rarely get the chance to show off that they are capable of novel thought because there are so many other humans that had to do novel thinking before them. Being able to do novel thinking effectively when needed is currently still a big gap between humans and AI, among others.

  • i think google is going to repeat history with gemini.. as in chatgpt, grok, etc will be like altavista, lycos, etc

I'm skeptical because my entire identity is basically built around being a software engineer and thinking my IQ and intelligence is higher than other people. If this AI stuff is real then it basically destroys my entire identity so I choose the most convenient conclusion.

Basically we all know that AI is just a stochastic parrot autocomplete. That's all it is. Anyone who doesn't agree with me is of lesser intelligence and I feel the need to inform them of things that are obvious: AI is not a human, it does not have emotions. It just a search engine. Those people who are using AI to code and do things that are indistinguishable from human reasoning are liars. I choose to focus on what AI gets wrong, like hallucinations, while ignoring the things it gets right.

"> Whatever it is, users have reported some truly wild things: it codes fully functioning Windows and Apple OS clones, 3D design software, Nintendo emulators, and productivity suites from single prompts."

Wow I'm doing it way wrong. How do I get the good stuff?

  • Your not.

    I want you to go into the kitchen and bake a cake. Please replace all the flour with baking soda. If it comes out looking limp and lifeless just decorate it up with extra layers of frosting.

    You can make something that looks like a cake but would not be good to eat.

    The cake, sometimes, is a lie. And in this case, so are likely most of these results... or they are the actual source code of some other project just regurgitated.

    • We got the results back. You are a horrible person. I’m serious, that’s what it says: “Horrible person.”

      We weren’t even testing for that.

      11 replies →