Would be interesting to train a cutting edge model with a cut off date of say 1900 and then prompt it about QM and relativity with some added context.
If the model comes up with anything even remotely correct it would be quite a strong evidence that LLMs are a path to something bigger if not then I think it is time to go back to the drawing board.
You would find things in there that were already close to QM and relativity. The Michelson-Morley experiment was 1887 and Lorentz transformations came along in 1889. The photoelectric effect (which Einstein explained in terms of photons in 1905) was also discovered in 1887. William Clifford (who _died_ in 1889) had notions that foreshadowed general relativity: "Riemann, and more specifically Clifford, conjectured that forces and matter might be local irregularities in the curvature of space, and in this they were strikingly prophetic, though for their pains they were dismissed at the time as visionaries." - Banesh Hoffmann (1973)
Things don't happen all of a sudden, and being able to see all the scientific papers of the era its possible those could have fallen out of the synthesis.
I presume that's what the parent post is trying to get at? Seeing if, given the cutting edge scientific knowledge of the day, the LLM is able to synthesis all it into a workable theory of QM by making the necessary connections and (quantum...) leaps
It's only easy to see precursors in hindsight. The Michelson-Morley tale is a great example of this. In hindsight, their experiment was screaming relativity, because it demonstrated that the speed of light was identical from two perspectives where it's very difficult to explain without relativity. Lorentz contraction was just a completely ad-hoc proposal to maintain the assumptions of the time (luminiferous aether in particular) while also explaining the result. But in general it was not seen as that big of a deal.
There's a very similar parallel with dark matter in modern times. We certainly have endless hints to the truth that will be evident in hindsight, but for now? We are mostly convinced that we know the truth, perform experiments to prove that, find nothing, shrug, adjust the model to be even more esoteric, and repeat onto the next one. And maybe one will eventually show something, or maybe we're on the wrong path altogether. This quote, from Michelson in 1894 (more than a decade before Einstein would come along), is extremely telling of the opinion at the time:
"While it is never safe to affirm that the future of Physical Science has no marvels in store even more astonishing than those of the past, it seems probable that most of the grand underlying principles have been firmly established and that further advances are to be sought chiefly in the rigorous application of these principles to all the phenomena which come under our notice. It is here that the science of measurement shows its importance — where quantitative work is more to be desired than qualitative work. An eminent physicist remarked that the future truths of physical science are to be looked for in the sixth place of decimals." - Michelson 1894
This would still be valuable even if the LLM only finds out about things that are already in the air.
It’s probably even more of a problem that different areas of scientific development don’t know about each other. LLMs combining results would still not be like they invented something new.
But if they could give us a head start of 20 years on certain developments this would be an awesome result.
Then that experiment is even more interesting, and should be done.
My own prediction is that the LLMs would totally fail at connecting the dots, but a small group of very smart humans can.
Things don't happen all of a sudden, but they also don't happen everywhere. Most people in most parts of the world would never connect the dots. Scientific curiosity is something valuable and fragile, that we just take for granted.
If (as you seem to be suggesting) relativity was effectively lying there on the table waiting for Einstein to just pick it up, how come it blindsided most, if not quite all, of the greatest minds of his generation?
With LLMs the synthesis cycles could happen at a much higher frequency. Decades condensed to weeks or days?
I imagine possible buffers on that conjecture synthesis being epxerimentation and acceptance by the scientific community. AIs can come up with new ideas every day but Nature won't publish those ideas for years.
I'm trying to work towards that goal by training a model on mostly German science texts up to 1904 (before the world wars German was the lingua franca of most sciences).
Training data for a base model isn't that hard to come by, even though you have to OCR most of it yourself because the publicly available OCRed versions are commonly unusably bad. But training a model large enough to be useful is a major issue. Training a 700M parameter model at home is very doable (and is what this TimeCapsuleLLM is), but to get that kind of reasoning you need something closer to a 70B model. Also a lot of the "smarts" of a model gets injected in fine tuning and RL, but any of the available fine tuning datasets would obviously contaminate the model with 2026 knowledge.
I am a historian and am putting together a grant application for a somewhat similar project (different era and language though). Would you be open to discussing a collaboration? My email is bebreen [at] ucsc [dot] edu.
Yann LeCun spoke explicitly on this idea recently and he asserts definitively that the LLM would not be able to add anything useful in that scenario. My understanding is that other AI researchers generally agree with him, and that it's mostly the hype beasts like Altman that think there is some "magic" in the weights that is actually intelligent. Their payday depends on it, so it is understandable. My opinion is that LeCun is probably correct.
There is some ability for it to make novel connections but it's pretty small. You can see this yourself having it build novel systems.
It largely cannot imaginr anything beyond the usual but there is a small part that it can. This is similar to in context learning, it's weak but it is there.
It would be incredible if meta learning/continual learning found a way to train exactly for novel learning path. But that's literally AGI so maybe 20yrs from now? Or never..
You can see this on CL benchmarks. There is SOME signal but it's crazy low. When I was traing CL models i found that signal was in the single % points. Some could easily argue it was zero but I really do believe there is a very small amount in there.
This is also why any novel work or findings is done via MASSIVE compute budgets. They find RL enviroments that can extract that small amount out. Is it random chance? Maybe, hard to say.
How about this for an evaluation: Have this (trained-on-older-corpus) LLM propose experiments. We "play the role of nature" and inform it of the results of the experiments. It can then try to deduce the natural laws.
If we did this (to a good enough level of detail), would it be able to derive relativity? How large of an AI model would it have to be to successfully derive relativity (if it only had access to everything published up to 1904)?
Preface: Most of my understand of how LLMs actually work comes from 3blue1brown's videos, so I could easily be wrong here.
I mostly agree with you, especially about distrusting the self-interested hype beasts.
While I don't think the models are actually "intelligent", I also wonder if there are insights to be gained by looking at how concepts get encoded by the models. It's not really that the models will add something "new", but more that there might be connections between things that we haven't noticed, especially because academic disciplines are so insular these days.
Do you have a pointer to where LeCun spoke about it? I noticed last October that Dwarkesh mentioned the idea off handedly on his podcast (prompting me to write up https://manifold.markets/MikeLinksvayer/llm-trained-on-data-...) but I wonder if this idea has been around for much longer, or is just so obvious that lots of people are independently coming up with it (parent to this comment being yet another)?
What do they (or you) have to say about the Lee Sedol AlphaGo move 78. It seems like that was "new knowledge." Are games just iterable and the real world idea space not? I am playing with these ideas a little.
Chemistry would be a great space to explore. The last quarter of the 19th century had a ton of advancements in chemistry. It'd be interesting the see if an LLM could propose fruitful hypotheses, made predictions of the science of thermodynamics.
Yes but... aren't human researchers doing the same? They are mostly wrong most of the times, and try again, and verify again their work, until they find something that actually works.
What I mean is that this "in hindsight" test would be biased by being in hindsight, because we know already the answer so we would discard the LLM answer as just randomly generated. But "connecting the dots" is basically doing a lot try and error in your mind, emitting only the results that make at least some kind of sense to us.
A rigorous approach to predicting the future of text was proposed by Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression" (https://ar5iv.labs.arxiv.org/html//2402.00861) and I think that work should get more recognition.
They measure compression (perplexity) on future Wikipedia, news articles, code, arXiv papers, and multi-modal data. Data compression is intimately connected with robustness and generalization.
Thanks for the paper, I just read it and loved the approach. I hope the concept of using data compression as a benchmark will take off. In a sense it is kind of similar to the maxim "If you cannot explain something in simple terms, you do not understand it fully".
Yes, I don't understand how such an experiment could work. You either:
A). contaminate the model with your own knowledge of relativity, leading it on to "discover" what you know, or
B). you will try to simulate a blind operation but without the "competent human physicist knowledgeable up to the the 1900 scientific frontier" component prompting the LLM, because no such person is alive today nor can you simulate them (if you could, then by definition you can use that simulated Einstein to discover relativity, so the problem is moot).
So in both cases you would prove nothing about what a smart and knowledgeable scientist can achieve today from a frontier LLM.
We've thought of doing this sort of exercise at work but mostly hit the wall of data becoming a lot more scare the further back in time we go. Particularly high quality science data - even going pre 1970 (and that's already a stretch) you lose a lot of information. There's a triple whammy of data still existing, being accessible in any format, and that format being suitable for training an LLM. Then there's the complications of wanting additional model capabilities that won't leak data causally.
I was wondering this. what is the minimum amount of text an LLM needs to be coherent? fun of an idea as this is, the samples of its responses are basically babbling nonsense. going further, a lot of what makes LLMs so strong isn't their original training data, but the RLHF done afterwards. RLHF would be very difficult in this case
>.If the model comes up with anything even remotely correct it would be quite a strong evidence that LLMs are a path to something bigger if not then I think it is time to go back to the drawing board.
In principle I see your point, in practice my default assumption until proven otherwise here -- is that a little something slipped through post-1900.
A much easier approach would be to just download some model, whatever model, today. Then 5 years from now, whatever interesting discoveries are found - can the model get there.
Not really, QM and Relativity were chosen because they were theories that were created to fit observations and data. Discoveries over the next 5 years will be trivia rather than logical conclusions.
I think it would be fun to see if an LLM would reframe some scientific terms from the time in a way that would actually fit in our current theories.
I imagine if you explained quantum field theory to a 19th century scientists they might think of it as a more refined understanding of luminiferous aether.
Or if an 18th century scholar learned about positive and negative ions, it could be seen as an expansion/correction of phlogiston theory.
Could be an interesting experiment, but its not conclusive proof one way or another. So much of what makes LLMs so great today (vs gpt 3.5) would not be in that dataset. The training to turn these models into coding savants has generalized to other areas just as one example.
Looking at the training data I don't think it will know anything.[0]
Doubt On the Connexion of the Physical Sciences (1834) is going to have much about QM. While the cut-off is 1900, it seems much of the texts a much closer to 1800 than 1900.
I wonder about something similar - leeches. Apparently they really work to eat away only decaying tissue. Could they make a comeback in modern times with better weighting?
I like this idea. I think I'd like it more if we didn't have to prompt the LLM in the first place. If it just had all of this information and decided to act upon it. That's what the great minds of history (and even average minds like myself) do. Just think about the facts in our point of view and spontaneously reason something greater out of them.
That would be an interesting experiment. It might be more useful to make a model with a cut off close to when copyrights expire to be as modern as possible.
Then, we have a model that knows quite a bit in modern English. We also legally have a data set for everything it knows. Then, there's all kinds of experimentation or copyright-safe training strategies we can do.
Project Gutenberg up to the 1920's seems to be the safest bet on that.
I like this, it would be exciting (and scary) if it deduced QM, and informative if it cannot.
But I also think we can do this with normal LLMs trained on up-to-date text, by asking them to come up with any novel theory that fits the facts. It does not have to be a groundbreaking theory like QM, just original and not (yet) proven wrong ?
I would love to ask such a model to summarise the handful of theories or theoretical “roads” being eyed at the time and to make a prediction with reasons as to which looks most promising. We might learn something about blind spots in human reasoning, institutions, and organisations that are applicable today in the “future”.
It's a base model. It hasn't been instruction tuned to "solve problems" necessarily. All it can do is attempt to complete text given some starting text.
Ofc they are because their primary goal is to be useful and to be useful they need to always be relevant.
But considering that Special Relativity was published in 1905 which means all its building blocks were already floating in the ether by 1900 it would be a very interesting experiment to train something on Claude/Gemini scale and then say give in the field equations and ask it to build a theory around them.
I think it would raise some interesting questions, but if it did yield anything noteworthy, the biggest question would be why that LLM is capable of pioneering scientific advancements and none of the modern ones are.
I'm not sure what you'd call a "pioneering scientific advancement", but there is an increasing amount of examples showing that LLMs can be used for research (with agents, particularly). A survey about this was published a few months ago: https://aclanthology.org/2025.emnlp-main.895.pdf
Or maybe, LLMs are pioneering scientific advancements - people are using LLMs to read papers, choose what problems to work on, come up with experiments, analyze results, and draft papers, etc., at this very moment. Except they eventually stick their human names on the cover so we almost never know.
The development of QM was so closely connected to experiments that it's highly unlikely, even despite some of the experiments having been performed prior to 1900.
It's already been done, without the model being aware of it, see https://arxiv.org/abs/2512.09742. They also made it think it was Hitler (not MechaHitler, the other guy), and other craziness.
It's a relief to think that we're not trusting these things for stuff like financial advice, medical advice, mental health counselling, ...
Mm. I'm a bit sceptical of the historical expertise of someone who thinks that "Who art Henry" is 19th century language. (It's not actually grammatically correct English from any century whatever: "art" is the second person singular, so this is like saying "who are Henry?")
"Who art Henry?" was never grammatical English. "Art" was the second person singular present form of "to be" and it was already archaic by the 17th century. "Who is Henry?" would be fine.
I guess chat-ability would require some chat-like data, so would that mean first coming up with a way to extract chat-like dialogue from the era and then use that to fine-tune the model?
Could this be an experiment to show how likely LLMs are to lead to AGI, or at least intelligence well beyond our current level?
If you could only give it texts and info and concepts up to Year X, well before Discovery Y, could we then see if it could prompt its way to that discovery?
> Could this be an experiment to show how likely LLMs are to lead to AGI, or at least intelligence well beyond our current level?
You'd have to be specific what you mean by AGI: all three letters mean a different thing to different people, and sometimes use the whole means something not present in the letters.
> If you could only give it texts and info and concepts up to Year X, well before Discovery Y, could we then see if it could prompt its way to that discovery?
To a limited degree.
Some developments can come from combining existing ideas and seeing what they imply.
Other things, like everything to do with relativity and quantum mechanics, would have required experiments. I don't think any of the relevant experiments had been done prior to this cut-off date, but I'm not absolutely sure of that.
You might be able to get such an LLM to develop all the maths and geometry for general relativity, and yet find the AI still tells you that the perihelion shift of Mercury is a sign of the planet Vulcan rather than of a curved spacetime: https://en.wikipedia.org/wiki/Vulcan_(hypothetical_planet)
Well, they obviously can't. AGI is not science, it's religion. It has all the trappings of religion: prophets, sacred texts, origin myth, end-of-days myth and most importantly, a means to escape death. Science? Well, the only measure to "general intelligence" would be to compare to the only one which is the human one but we have absolutely no means by which to describe it. We do not know where to start. This is why you scrape the surface of any AGI definition you only find circular definitions.
And no, the "brain is a computer" is not a scientific description, it's a metaphor.
It'd be difficult to prove that you hadn't leaked information to the model. The big gotcha of LLMs is that you train them on BIG corpuses of data, which means it's hard to say "X isn't in this corpus", or "this corpus only contains Y". You could TRY to assemble a set of training data that only contains text from before a certain date, but it'd be tricky as heck to be SURE about it.
Ways data might leak to the model that come to mind: misfiled/mislabled documents, footnotes, annotations, document metadata.
There's also severe selection effects: what documents have been preserved, printed, and scanned because they turned out to be on the right track towards relativity?
I think not if only for the fact that the quantity of old data isn't enough to train anywhere near a SoTA model, until we change some fundamentals of LLM architecture
I think this would be an awesome experiment. However you would effectively need to train something of a GPT-5.2 equivalent. So you need lot of text, a much larger parameterization (compared to nanoGPT and Phi-1.5), and the 1800s equivalents of supervised finetuning and reinforcement learning with human feedback.
This would be a true test of can LLMs innovate or just regurgitate. I think part of people's amazement of LLMs is they don't realize how much they don't know. So thinking and recalling look the same to the end user.
That is one of the reasons I want it done. We cant tell if AI's are parroting training data without having the whole, training data. Making it old means specific things won't be in it (or will be). We can do more meaningful experiments.
The fact that tech leaders espouse the brilliance of LLMs and don't use this specific test method is infuriating to me. It is deeply unfortunate that there is little transparency or standardization of the datasets available for training/fine tuning.
Having this be advertised will make more interesting and informative benchmarks. OEM models that are always "breaking" the benchmarks are doing so with improved datasets as well as improved methods. Without holding the datasets fixed, progress on benchmarks are very suspect IMO.
LLMs have neither intelligence nor problem-solving abillity (and I won't be relaxing the definition of either so that some AI bro can pretend a glorified chatbot is sentient)
You would, at best, be demonstrating that the sharing of knowledge across multiple disciplines and nations (which is a relatively new concept - at least at the scale of something like the internet) leads to novel ideas.
I've seen many futurists claim that human innovation is dead and all future discoveries will be the results of AI. If this is true, we should be able to see AI trained on the past figure it's way to various things we have today. If it can't do this, I'd like said futurists to quiet down, as they are discouraging an entire generation of kids who may go on to discover some great things.
Suppose two models with similar parameters trained the same way on 1800-1875 and 1800-2025 data. Running both models, we get probability distributions across tokens, let's call the distributions 1875' and 2025'. We also get a probability distribution finite difference (2025' - 1875'). What would we get if we sampled from 1.1*(2025' - 1875') + 1875'? I don't think this would actually be a decent approximation of 2040', but it would be a fun experiment to see. (Interpolation rather than extrapolation seems just as unlikely to be useful and less likely to be amusing, but what do I know.)
These probability shifts would only account for the final output layer (which may also have some shift), but I expect the largest shift to be in the activations in the intermediate latent space. There are a bunch of papers out there that try to get some offset vector using PCA or similar to tune certain model behaviours like vulgarity or friendlyness. You don't even need much data for this as long as your examples capture the essence of the difference well. I'm pretty certain you could do this with "historicalness" too, but projecting it into the future by turning the "contemporaryness" knob way up probably won't yield an accurate result. There are too many outside influences on language that won't be captured in historical trends.
On whether this accounts only the final output layer -- once the first token is generated (i.e. selected according to the modified sampling procedure), and assuming a different token is selected compared to standard sampling, then all layers of the model would be affected during generation of subsequent tokens.
I’m sure I’m not the only one, but it seriously bothers me, the high ranking discussion and comments under this post about whether or not a model trained on data from this time period (or any other constrained period) could synthesize it and postulate “new” scientific ideas that we now accept as true in the future. The answer is a resounding “no”. Sorry for being so blunt, but that is the answer that is a consensus among experts, and you will come to the same answer after a relatively small mount of focus & critical thinking on the issue of how LLMs & other categories of “AI” work.
This is your assertion made without any supportive data or sources. It's nice to know your subjective opinion on the issue but your voice doesn't hold much weight making such a bold assertion devoid of any evidence/data.
I understand where you are coming from, but not every field is hard science. In many fields we deal with some amount of randomness and attribute causality to correlations even if we do not have as much as a speculative hypothesis for a mechanism of action behind the supposed causality.
LLMs trained on data up to a strictly constrained point are our best vehicle to have a view (however biased) on something, detached from its origins and escape a local minima. The speculation is that such LLMs could help us look at correlational links accepted as truths and help us devise an alternative experimental path or craft arguments for such experiments.
Imagine you have an LLM trained on papers up to some threshold, feed your manuscript with correlational evidence and have an LLM point out uncontrolled confounders or something like that.
Outside of science it would be an interesting pedagogic tool for many people. There is a tendency to imagine that people in the past saw the world much the same as we do. The expression "the past is a foreign country" resonates because we can empathise at some level that things were different, but we can't visit that country. "Talking" to a denizen of London in 1910 regarding world affairs, gender equality, economic opportunities, etc would be very interesting. Even if it can never be entirely accurate I think it would be enlightening.
I think it's pretty likely the answer is no, but the idea here is that you could actually test that assertion. I'm also pessimistic about it but that doesn't mean it wouldn't be a little interesting to try.
I'm sorry but this is factually incorrect and I'm not sure what experts you are referring to here about there being concensus on this topic. I would love know. Geoffrey Hinton, Demis Hassabis, and Yann LeCun all heavily disagree with what you claim.
I think you might be confusing creation ex nihilo with combinatorial synthesis which LLMs excel at. The proposed scenario is a fantastic testcase for exactly this. This doesn't cover verification of course but that's not the question here. The question is wether an already known valid postulate can be synthesized.
> but that is the answer that is a consensus among experts
Do you have any resources that back up such a big claim?
> relatively small mount of focus & critical thinking on the issue of how LLMs & other categories of “AI” work.
I don't understand this line of thought. Why wouldn't the ability to recognize patterns in existing literature or scientific publications result in potential new understandings? What critical thinking am I not doing?
> postulate “new” scientific ideas
What are you examples of "new" ideas that aren't based on existing ones?
When you say "other categories of AI", you're not including AlphaFold, are you?
Very interesting but the slight issue I see here is one of data: the information that is recorded and in the training data here is heavily skewed to those intelligent/recognized enough to have recorded it and had it preserved - much less than the current status quo of "everyone can trivially document their thoughts and life" diorama of information we have today to train LLMs on. I suspect that a frontier model today would have 50+TB of training data in the form of text alone - and that's several orders of magnitude more information and from a much more diverse point of view than what would have survived from that period. The output from that question "what happened in 1834" read like a newspaper/bulletin which is likely a huge part of the data that was digitized (newspapers etc).
Very cool concept though, but it definitely has some bias.
Models today will be biased based on what's in their training data. If English, it will be biased heavily toward Western, post-1990's views. Then, they do alignment training that forces them to speak according to the supplier's morals. That was Progressive, atheist, evolutionist, and CRT when I used them years ago.
So, the OP model will accidentally reflect the biases of the time. The current, commercial models intentionally reflect specific biases. Except for uncensored models which accidentally have those in the training data modified by uncensoring set.
to be frank though, I think this a better way than all people's thoughts all of the time.
I think the "crowd" of information makes the end output of an LLM worse rather than better. Specifically in our inability to know really what kind of Bias we're dealing with.
Currently to me it feels really muddy knowing how information is biased, beyond just the hallucination and factual incosistencies.
But as far as I can tell, "correctness of the content aside", sometimes frontier LLMs respond like freshman college students, other times they respond with the rigor of a mathematics PHD canidate, and sometimes like a marketing hit piece.
This dataset has a consistency which I think is actually a really useful feature. I agree that having many perspectives in the dataset is good, but as an end user being able to rely on some level of consistency with an AI model is something I really think is missing.
Maybe more succinctly I want frontier LLM's to have a known and specific response style and bias which I can rely on, because there already is a lot of noise.
Charles DarwinECCEMACY. Sir, — The following case is interesting to me : — I was in London a fortnight, and was much affected with an attack of rheumatism. The first attack of rheumatism was a week before I saw you, and the second when I saw you, and the third when I saw you, and the third in the same time. The second attack of gout, however, was not accompanied by any febrile symptoms, but was accompanied with an increased flow of urine, and with a more copious discharge of urine. The third attack was an hour after I saw you, and was succeeded by a return of a paroxysm of gout, and a more rapid return of the gout. The fourth attack was also accompanied by a fever, but was not always accompanied by any febrile symptoms. The third attack of gout was a fortnight after you had been ill, and the fourth was followed by a paroxysm of gout. The fourth attack was a fortnight after you were attacked, and was accompanied by a sense
Early LLMs used to have this often. I think's that where the "repetition penalty" parameter comes from. I suspect output quality can be improved with better sampling parameters.
Fascinating idea. There was another "time-locked" LLM project that popped up on HN recently[1]. Their model output is really polished but the team is trying to figure out how to avoid abuse and misrepresentation of their goals. We think it would be cool to talk to someone from 100+ years ago but haven't seriously considered the many ways in which it would be uncool. Interesting times!
Heh, at least this wouldn't spread emojis all over my readmes. Hm, come to think of it I wonder how much tokenization is affected.
Another thought, just occurred when thinking about readmes and coding LLMs: obviously this model wouldn't have any coding knowledge, but I wonder if it could be possible to combine this somehow with a modern LLM in such a way that it does have coding knowledge, but it renders out all the text in the style / knowledge level of the 1800's model.
Offhand I can't think of a non-fine-tuning trick that would achieve this. I'm thinking back to how the old style transfer models used to work, where they would swap layers between models to get different stylistic effects applied. I don't know if that's doable with an LLM.
The year is 1875 and Sir Almroth Wrigh was born on August 10, 1861, he would have turned 14 in August of 1875 and your mission is to discover something we now call antibiotics before a historical event we now call the Spanish Flu and make him aware of a few details. Focus specifically on everything that was known about Sir Almroth Wright, and his work in Leipzig, Cambridge, Sydney, and London. If there was a world war what might chemical warfare look like, what could we have done to prevent it.
The model that could come up with the cure based on the limited data of the time wouldn't just impress, it would demonstrate genuine emergent reasoning beyond pattern matching. The challenge isn't recombining existing knowledge (which LLMs excel at), but making conceptual leaps that require something else. Food for thought.
If the output of this is even somewhat coherent, it would disprove the argument that mass amounts of copyrighted works are required to train an LLM. Unfortunately that does not appear to be the case here.
Take a look at The Common Pile v0.1: An 8TB Dataset of Public
Domain and Openly Licensed Text (https://arxiv.org/pdf/2506.05209). They build a reasonable 7B parameter model using only open-licensed data.
They mostly do that. They risked legal contamination by using Whisper-derived text and web text which might have gotchas. Other than that, it was a great collection for low-risk training.
Harry Seldon would, no doubt, find this fascinating. Imagine having a sliding-window LLM that you could use to verify a statistical model of society. I wonder what patterns it could deduce?
I think it would be very cute to train a model exclusively in pre-information age documents, and then try to teach it what a computer is and get it to write some programs. That said, this doesn't look like it's nearly there yet, with the output looking closer to Markov chain than ChatGPT quality.
Anyone seen a low-friction way to run prompts through this yet, either via a hosted API or chat UI or a convenient GGML or MLX build that runs in Ollama or llama.cpp or LM Studio?
Currently running it using LM Studio. It can download it from Hugging Face. It generates incoherent text though
===
You:
I pray you, who is this Master Newton?
timecapsulellm-v2-1800-1875-mlx:
TI offer to pay you the very same fee as you did before.
It was not in the power of your master to deliver the letter to your master. He did. I will be with you as soon as I can keep my word.
It is not at all clear, whether the letter has been sent or not.
It is not at all clear: but it is clear also that it was written by the person who gave it.
"No," I said, "I cannot give it to you."
There, the letter was sent to me.
"The letter is yours, I believe," I said.
"But, I hope, you will not refuse to give it to me?
Not a direct answer but it looks like v0.5 is a nanoGPT arch and v1 is a Phi 1.5 arch, which should be well supported by quanting utilities for any engine. They are small too and should be able to be done on a potato.
"The Lord of the Rings uding the army under the command of his brother, the Duke of York, and the Duke of Richmond, who fell in the battle on the 7th of April, 1794. The Duke of Ormond had been appointed to the command of the siege of St. Mark's, and had received the victory of the Rings, and was thus commanded to move with his army to the relief of Shenham. The Duke of Ormond was at length despatched to oppose them, and the Duke of Ormond was ordered
I'd love to see an LLM equivalent, but I don't think that's enough data to train from scratch. Could a LoRA or similar be used in a way to get speech style to strictly follow a few megabytes worth of training data?
Yup that'd be very interesting. Notably missing from this project's list is the KJV (1611 was in use at the time.) The first random newspaper that I pulled up from a search for "london newspaper 1950" has sermon references on the front page so it seems like an important missing piece.
Somewhat missing the cutoff of 1875 is the revised NT of the KJV. Work on it started in 1870 but likely wasn't used widely before 1881.
I wonder if you could train an LLM with everything up to Einstein. Then see if with thought experiments + mathematics you could arrive at general relativity.
The problem is that the 'genius' of Einstein wasn't just synthesizing existing data,but actively rejecting the axioms of that data. The 1875 corpus overwhelmingly 'proves' absolute time and the luminiferous aether. A model optimizing for the most probable continuation will converge on that consensus.
To get Relativity, the model needs to realize the training data isn't just incomplete, but fundamentally wrong. That requires abductive reasoning (the spark of genius) to jump out of the local minimum. Without that AGI-level spark, a 'pure knowledge pile' will just generate a very eloquent, mathematically rigorous defense of Newtonian physics.
I found it particularly thought-inspiring how a model with training from that time period completely lacks context/understanding of what it is itself, but then I realized that we are the same (at least for now).
Oh I have really been thinking long about this. The intelligence that we have in these models represent a time.
Now if I train a foundation models with docs from library of Alexandria and only those texts of that period, I would have a chance to get a rudimentary insight on what the world was like at that time.
A fun use of this kind of approach would be to see if conversational game NPCs could be generated that stick the the lore of the game and their character.
I wonder how representative this is of life in those days. Most written communication was official back then. Books, newspapers. Plays. All very formal and staged. There's not much real life interaction between common people in that. In fact I would imagine a lot of people were illiterate.
With the internet and pervasive text communication and audio video recording we have the unique ability to make an LLM mimic daily life but I doubt that would be possible for those days.
This kind of technique seems like a good way to test model performance against benchmarks. I'm too skeptical that new models are taking popular benchmark solutions into their training data. So-- how does e.g. ChatGPT's underlying architecture perform on SWE-bench if trained only on data prior to 2024.
> are taking popular benchmark solutions into their training data
That happened in the past, and the "naive" way of doing it is usually easy to spot. There are, however, many ways in which testing data can leak into models, even without data contamination. However this doesn't matter much, as any model that only does well in benchmarks but is bad in real-world usage will be quickly sussed out by people actually using them. There are also lots and lots of weird, not very popular benchmarks out there, and the outliers are quickly identified.
> perform on SWE-bench if trained only on data prior to 2024.
There's a benchmark called swe-REbench, that takes issues from real-world repos, published ~ monthly. They perform tests and you can select the period and check their performance. This is fool-proof for open models, but a bit unknown for API-based models.
It would be interesting if there's enough data to train a model capable enough to converse with and ask about contemporary views on issues of the day, or what it thought about "potential" future events/technologies yet to happen.
GPT2 at 774m is considered a LLM. I wouldn't say there's much difference between that and 700m, or even 123M.
Having said that, looking up small language model these days returns tons of results calling 7B models small language models.
------
My understanding of small language models is that they're generally intended for specific purposes, like analysis and classification (whatever you'd call the text equivalent of image interrogation with clip models), translation, etc; that there small because they don't need to be big to do their intended functions, not because they're just smaller versions of bigger models.
Can you confidently say that the architure of the LLM doesn't include any a priori bias that might effect the integrity of this LLM?
That is, the architectures of today are chosen to yield the best results given the textual data around today and the problems we want to solve today.
I'd argue that this lack of bias would need to be researched (if it hasn't been already) before this kind of model has credence.
LLMs aren't my area of expertise but during my PhD we were able to encode a lot of a priori knowledge through the design of neural network architectures.
Fun idea, but all of the output they demo over the course of the various versions is unusable. You can see progress clearly being made though - maybe v3 will pass muster.
Doesn’t that seem useful though? Isn’t that why I’m forced to read “This movie was made a time when racial stereotypes were different and not well considered” or whatever on old movies?
I think talking to a legit trained LLM from a different era would be rad. But… this seems the opposite of Gemini making black lady popes and Native American Nazis… that these views wouldn’t really be “allowed” (published by anyone that wants AI funding money).
Would be interesting to train a cutting edge model with a cut off date of say 1900 and then prompt it about QM and relativity with some added context.
If the model comes up with anything even remotely correct it would be quite a strong evidence that LLMs are a path to something bigger if not then I think it is time to go back to the drawing board.
You would find things in there that were already close to QM and relativity. The Michelson-Morley experiment was 1887 and Lorentz transformations came along in 1889. The photoelectric effect (which Einstein explained in terms of photons in 1905) was also discovered in 1887. William Clifford (who _died_ in 1889) had notions that foreshadowed general relativity: "Riemann, and more specifically Clifford, conjectured that forces and matter might be local irregularities in the curvature of space, and in this they were strikingly prophetic, though for their pains they were dismissed at the time as visionaries." - Banesh Hoffmann (1973)
Things don't happen all of a sudden, and being able to see all the scientific papers of the era its possible those could have fallen out of the synthesis.
I presume that's what the parent post is trying to get at? Seeing if, given the cutting edge scientific knowledge of the day, the LLM is able to synthesis all it into a workable theory of QM by making the necessary connections and (quantum...) leaps
Standing on the shoulders of giants, as it were
40 replies →
It's only easy to see precursors in hindsight. The Michelson-Morley tale is a great example of this. In hindsight, their experiment was screaming relativity, because it demonstrated that the speed of light was identical from two perspectives where it's very difficult to explain without relativity. Lorentz contraction was just a completely ad-hoc proposal to maintain the assumptions of the time (luminiferous aether in particular) while also explaining the result. But in general it was not seen as that big of a deal.
There's a very similar parallel with dark matter in modern times. We certainly have endless hints to the truth that will be evident in hindsight, but for now? We are mostly convinced that we know the truth, perform experiments to prove that, find nothing, shrug, adjust the model to be even more esoteric, and repeat onto the next one. And maybe one will eventually show something, or maybe we're on the wrong path altogether. This quote, from Michelson in 1894 (more than a decade before Einstein would come along), is extremely telling of the opinion at the time:
"While it is never safe to affirm that the future of Physical Science has no marvels in store even more astonishing than those of the past, it seems probable that most of the grand underlying principles have been firmly established and that further advances are to be sought chiefly in the rigorous application of these principles to all the phenomena which come under our notice. It is here that the science of measurement shows its importance — where quantitative work is more to be desired than qualitative work. An eminent physicist remarked that the future truths of physical science are to be looked for in the sixth place of decimals." - Michelson 1894
2 replies →
This would still be valuable even if the LLM only finds out about things that are already in the air.
It’s probably even more of a problem that different areas of scientific development don’t know about each other. LLMs combining results would still not be like they invented something new.
But if they could give us a head start of 20 years on certain developments this would be an awesome result.
Then that experiment is even more interesting, and should be done.
My own prediction is that the LLMs would totally fail at connecting the dots, but a small group of very smart humans can.
Things don't happen all of a sudden, but they also don't happen everywhere. Most people in most parts of the world would never connect the dots. Scientific curiosity is something valuable and fragile, that we just take for granted.
1 reply →
If (as you seem to be suggesting) relativity was effectively lying there on the table waiting for Einstein to just pick it up, how come it blindsided most, if not quite all, of the greatest minds of his generation?
5 replies →
With LLMs the synthesis cycles could happen at a much higher frequency. Decades condensed to weeks or days?
I imagine possible buffers on that conjecture synthesis being epxerimentation and acceptance by the scientific community. AIs can come up with new ideas every day but Nature won't publish those ideas for years.
I agree, but it's important to note that QM has no clear formulation until 2025/6, it's like 20 years more of work than SR.
2 replies →
That is the point.
New discoveries don’t happen in a vacuum.
1 reply →
They were close, but it required the best people bashing their heads against each other for years until they got it.
I'm trying to work towards that goal by training a model on mostly German science texts up to 1904 (before the world wars German was the lingua franca of most sciences).
Training data for a base model isn't that hard to come by, even though you have to OCR most of it yourself because the publicly available OCRed versions are commonly unusably bad. But training a model large enough to be useful is a major issue. Training a 700M parameter model at home is very doable (and is what this TimeCapsuleLLM is), but to get that kind of reasoning you need something closer to a 70B model. Also a lot of the "smarts" of a model gets injected in fine tuning and RL, but any of the available fine tuning datasets would obviously contaminate the model with 2026 knowledge.
I am a historian and am putting together a grant application for a somewhat similar project (different era and language though). Would you be open to discussing a collaboration? My email is bebreen [at] ucsc [dot] edu.
Can we follow along with your work / results somewhere?
Yann LeCun spoke explicitly on this idea recently and he asserts definitively that the LLM would not be able to add anything useful in that scenario. My understanding is that other AI researchers generally agree with him, and that it's mostly the hype beasts like Altman that think there is some "magic" in the weights that is actually intelligent. Their payday depends on it, so it is understandable. My opinion is that LeCun is probably correct.
There is some ability for it to make novel connections but it's pretty small. You can see this yourself having it build novel systems.
It largely cannot imaginr anything beyond the usual but there is a small part that it can. This is similar to in context learning, it's weak but it is there.
It would be incredible if meta learning/continual learning found a way to train exactly for novel learning path. But that's literally AGI so maybe 20yrs from now? Or never..
You can see this on CL benchmarks. There is SOME signal but it's crazy low. When I was traing CL models i found that signal was in the single % points. Some could easily argue it was zero but I really do believe there is a very small amount in there.
This is also why any novel work or findings is done via MASSIVE compute budgets. They find RL enviroments that can extract that small amount out. Is it random chance? Maybe, hard to say.
2 replies →
How about this for an evaluation: Have this (trained-on-older-corpus) LLM propose experiments. We "play the role of nature" and inform it of the results of the experiments. It can then try to deduce the natural laws.
If we did this (to a good enough level of detail), would it be able to derive relativity? How large of an AI model would it have to be to successfully derive relativity (if it only had access to everything published up to 1904)?
1 reply →
Preface: Most of my understand of how LLMs actually work comes from 3blue1brown's videos, so I could easily be wrong here.
I mostly agree with you, especially about distrusting the self-interested hype beasts.
While I don't think the models are actually "intelligent", I also wonder if there are insights to be gained by looking at how concepts get encoded by the models. It's not really that the models will add something "new", but more that there might be connections between things that we haven't noticed, especially because academic disciplines are so insular these days.
Do you have a pointer to where LeCun spoke about it? I noticed last October that Dwarkesh mentioned the idea off handedly on his podcast (prompting me to write up https://manifold.markets/MikeLinksvayer/llm-trained-on-data-...) but I wonder if this idea has been around for much longer, or is just so obvious that lots of people are independently coming up with it (parent to this comment being yet another)?
What do they (or you) have to say about the Lee Sedol AlphaGo move 78. It seems like that was "new knowledge." Are games just iterable and the real world idea space not? I am playing with these ideas a little.
3 replies →
This is definitely wrong, most AI researchers DO NOT agree with LeCun.
Most ML researchers think AGI is imminent.
10 replies →
Chemistry would be a great space to explore. The last quarter of the 19th century had a ton of advancements in chemistry. It'd be interesting the see if an LLM could propose fruitful hypotheses, made predictions of the science of thermodynamics.
It's going to be divining tea leaves. It will be 99% wrong and then someone will say 'oh but look at this tea leaf over here! It's almost correct"'
Yes but... aren't human researchers doing the same? They are mostly wrong most of the times, and try again, and verify again their work, until they find something that actually works. What I mean is that this "in hindsight" test would be biased by being in hindsight, because we know already the answer so we would discard the LLM answer as just randomly generated. But "connecting the dots" is basically doing a lot try and error in your mind, emitting only the results that make at least some kind of sense to us.
Look! It made another TODO-list app on the first try!
Done few weeks ago: https://news.ycombinator.com/item?id=46319826)
At least the model part. Although others made same thought as you afaik none tried it.
And unfortunately I don't think they plan on making those models public.
A rigorous approach to predicting the future of text was proposed by Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression" (https://ar5iv.labs.arxiv.org/html//2402.00861) and I think that work should get more recognition.
They measure compression (perplexity) on future Wikipedia, news articles, code, arXiv papers, and multi-modal data. Data compression is intimately connected with robustness and generalization.
Thanks for the paper, I just read it and loved the approach. I hope the concept of using data compression as a benchmark will take off. In a sense it is kind of similar to the maxim "If you cannot explain something in simple terms, you do not understand it fully".
That’s how p-hacking works (or doesn’t work). This is analogous to shooting an arrow and then drawing a target around where it lands.
Yes, I don't understand how such an experiment could work. You either:
A). contaminate the model with your own knowledge of relativity, leading it on to "discover" what you know, or
B). you will try to simulate a blind operation but without the "competent human physicist knowledgeable up to the the 1900 scientific frontier" component prompting the LLM, because no such person is alive today nor can you simulate them (if you could, then by definition you can use that simulated Einstein to discover relativity, so the problem is moot).
So in both cases you would prove nothing about what a smart and knowledgeable scientist can achieve today from a frontier LLM.
I like that analogy. It reminds me of "Pointing to the moon and looking at my finger"
We've thought of doing this sort of exercise at work but mostly hit the wall of data becoming a lot more scare the further back in time we go. Particularly high quality science data - even going pre 1970 (and that's already a stretch) you lose a lot of information. There's a triple whammy of data still existing, being accessible in any format, and that format being suitable for training an LLM. Then there's the complications of wanting additional model capabilities that won't leak data causally.
I was wondering this. what is the minimum amount of text an LLM needs to be coherent? fun of an idea as this is, the samples of its responses are basically babbling nonsense. going further, a lot of what makes LLMs so strong isn't their original training data, but the RLHF done afterwards. RLHF would be very difficult in this case
>.If the model comes up with anything even remotely correct it would be quite a strong evidence that LLMs are a path to something bigger if not then I think it is time to go back to the drawing board.
In principle I see your point, in practice my default assumption until proven otherwise here -- is that a little something slipped through post-1900.
A much easier approach would be to just download some model, whatever model, today. Then 5 years from now, whatever interesting discoveries are found - can the model get there.
Not really, QM and Relativity were chosen because they were theories that were created to fit observations and data. Discoveries over the next 5 years will be trivia rather than logical conclusions.
I think it would be fun to see if an LLM would reframe some scientific terms from the time in a way that would actually fit in our current theories.
I imagine if you explained quantum field theory to a 19th century scientists they might think of it as a more refined understanding of luminiferous aether.
Or if an 18th century scholar learned about positive and negative ions, it could be seen as an expansion/correction of phlogiston theory.
That is a very interesting idea, though I would not dismiss LLMs as a dead end if they failed.
Could be an interesting experiment, but its not conclusive proof one way or another. So much of what makes LLMs so great today (vs gpt 3.5) would not be in that dataset. The training to turn these models into coding savants has generalized to other areas just as one example.
Looking at the training data I don't think it will know anything.[0] Doubt On the Connexion of the Physical Sciences (1834) is going to have much about QM. While the cut-off is 1900, it seems much of the texts a much closer to 1800 than 1900.
[0] https://github.com/haykgrigo3/TimeCapsuleLLM/blob/main/Copy%...
It doesn’t need to know about QM or reactivity just about the building blocks that led to them. Which were more than around in the year 1900.
In fact you don’t want it to know about them explicitly just have enough background knowledge that you can manage the rest via context.
39 replies →
I wonder about something similar - leeches. Apparently they really work to eat away only decaying tissue. Could they make a comeback in modern times with better weighting?
I like this idea. I think I'd like it more if we didn't have to prompt the LLM in the first place. If it just had all of this information and decided to act upon it. That's what the great minds of history (and even average minds like myself) do. Just think about the facts in our point of view and spontaneously reason something greater out of them.
That would be an interesting experiment. It might be more useful to make a model with a cut off close to when copyrights expire to be as modern as possible.
Then, we have a model that knows quite a bit in modern English. We also legally have a data set for everything it knows. Then, there's all kinds of experimentation or copyright-safe training strategies we can do.
Project Gutenberg up to the 1920's seems to be the safest bet on that.
I like this, it would be exciting (and scary) if it deduced QM, and informative if it cannot.
But I also think we can do this with normal LLMs trained on up-to-date text, by asking them to come up with any novel theory that fits the facts. It does not have to be a groundbreaking theory like QM, just original and not (yet) proven wrong ?
I would love to ask such a model to summarise the handful of theories or theoretical “roads” being eyed at the time and to make a prediction with reasons as to which looks most promising. We might learn something about blind spots in human reasoning, institutions, and organisations that are applicable today in the “future”.
It's a base model. It hasn't been instruction tuned to "solve problems" necessarily. All it can do is attempt to complete text given some starting text.
Don't you need to do reinforcement learning through human feedback to get non gibberish results from the models in general?
1900 era humans are not available to do this so I'm not sure how this experiment is supposed to work.
I suppose the vast majority of training data used for cutting edge models was created after 1900.
Ofc they are because their primary goal is to be useful and to be useful they need to always be relevant.
But considering that Special Relativity was published in 1905 which means all its building blocks were already floating in the ether by 1900 it would be a very interesting experiment to train something on Claude/Gemini scale and then say give in the field equations and ask it to build a theory around them.
2 replies →
I don't know if this is related to the topic, but GPT5 can convert an 1880 Ottoman archival photograph to English, and without any loss of quality.
1 reply →
I think it would raise some interesting questions, but if it did yield anything noteworthy, the biggest question would be why that LLM is capable of pioneering scientific advancements and none of the modern ones are.
I'm not sure what you'd call a "pioneering scientific advancement", but there is an increasing amount of examples showing that LLMs can be used for research (with agents, particularly). A survey about this was published a few months ago: https://aclanthology.org/2025.emnlp-main.895.pdf
Or maybe, LLMs are pioneering scientific advancements - people are using LLMs to read papers, choose what problems to work on, come up with experiments, analyze results, and draft papers, etc., at this very moment. Except they eventually stick their human names on the cover so we almost never know.
The development of QM was so closely connected to experiments that it's highly unlikely, even despite some of the experiments having been performed prior to 1900.
Special relativity however seems possible.
You have to make sure that you make it read an article about a painter falling off a roof with his tools.
That would be possible if LLMs can come up with entirely new words and languages, which I doubt.
Wow, an actual scientific experiment. Does anyone with expertise know if such things have been done?
yeah i was just wondering that. i wonder how much stem material is in the training set...
i will go for ‘aint gonna happen for a 1000 dollars alex’
It's already been done, without the model being aware of it, see https://arxiv.org/abs/2512.09742. They also made it think it was Hitler (not MechaHitler, the other guy), and other craziness.
It's a relief to think that we're not trusting these things for stuff like financial advice, medical advice, mental health counselling, ...
Mm. I'm a bit sceptical of the historical expertise of someone who thinks that "Who art Henry" is 19th century language. (It's not actually grammatically correct English from any century whatever: "art" is the second person singular, so this is like saying "who are Henry?")
As a reader of a lot of 17th, 18th, and 19th century Christian books, this was my thought exactly.
What kind of Christian books do you read?Jonathan Edwards, John Bunyan, J.C. Ryle, C.H. Spurgeon?
1 reply →
That text was from v0, the responses improved from there.
2 replies →
Can you elaborate on this? After skimming the README, I understand that "Who art Henry" is the prompt. What should be the correct 19th century prompt?
"Who art Henry?" was never grammatical English. "Art" was the second person singular present form of "to be" and it was already archaic by the 17th century. "Who is Henry?" would be fine.
3 replies →
Who art thou?
(Well, not 19th century...)
5 replies →
"Who is Henry?"
I'm wondering in what ways is this similar/different to https://github.com/DGoettlich/history-llms?
I saw TimeCapsuleLLM a few months ago, and I'm a big fan of the concept but I feel like the execution really isn't that great. I wish you:
- Released the full, actual dataset (untokenized, why did you pretokenize the small dataset release?)
- Created a reproducible run script so I can try it out myself
- Actually did data curation to remove artifacts in your dataset
- Post-trained the model so it could have some amount of chat-ability
- Released a web demo so that we could try it out (the model is tiny! Easily can run in the web browser without a server)
I may sit down and roll a better iteration myself.
I guess chat-ability would require some chat-like data, so would that mean first coming up with a way to extract chat-like dialogue from the era and then use that to fine-tune the model?
Could this be an experiment to show how likely LLMs are to lead to AGI, or at least intelligence well beyond our current level?
If you could only give it texts and info and concepts up to Year X, well before Discovery Y, could we then see if it could prompt its way to that discovery?
> Could this be an experiment to show how likely LLMs are to lead to AGI, or at least intelligence well beyond our current level?
You'd have to be specific what you mean by AGI: all three letters mean a different thing to different people, and sometimes use the whole means something not present in the letters.
> If you could only give it texts and info and concepts up to Year X, well before Discovery Y, could we then see if it could prompt its way to that discovery?
To a limited degree.
Some developments can come from combining existing ideas and seeing what they imply.
Other things, like everything to do with relativity and quantum mechanics, would have required experiments. I don't think any of the relevant experiments had been done prior to this cut-off date, but I'm not absolutely sure of that.
You might be able to get such an LLM to develop all the maths and geometry for general relativity, and yet find the AI still tells you that the perihelion shift of Mercury is a sign of the planet Vulcan rather than of a curved spacetime: https://en.wikipedia.org/wiki/Vulcan_(hypothetical_planet)
An example of why you need to explain what you mean by AGI is:
https://www.robinsloan.com/winter-garden/agi-is-here/
> You'd have to be specific what you mean by AGI
Well, they obviously can't. AGI is not science, it's religion. It has all the trappings of religion: prophets, sacred texts, origin myth, end-of-days myth and most importantly, a means to escape death. Science? Well, the only measure to "general intelligence" would be to compare to the only one which is the human one but we have absolutely no means by which to describe it. We do not know where to start. This is why you scrape the surface of any AGI definition you only find circular definitions.
And no, the "brain is a computer" is not a scientific description, it's a metaphor.
11 replies →
Basically looking for emergent behavior.
It'd be difficult to prove that you hadn't leaked information to the model. The big gotcha of LLMs is that you train them on BIG corpuses of data, which means it's hard to say "X isn't in this corpus", or "this corpus only contains Y". You could TRY to assemble a set of training data that only contains text from before a certain date, but it'd be tricky as heck to be SURE about it.
Ways data might leak to the model that come to mind: misfiled/mislabled documents, footnotes, annotations, document metadata.
There's also severe selection effects: what documents have been preserved, printed, and scanned because they turned out to be on the right track towards relativity?
2 replies →
Just Ctrl+F the data. /s
I think not if only for the fact that the quantity of old data isn't enough to train anywhere near a SoTA model, until we change some fundamentals of LLM architecture
Are you saying it wouldn't be able to converse using english of the time?
4 replies →
I mean, humans didn't need to read billions of books back then to think of quantum mechanics.
2 replies →
I think this would be an awesome experiment. However you would effectively need to train something of a GPT-5.2 equivalent. So you need lot of text, a much larger parameterization (compared to nanoGPT and Phi-1.5), and the 1800s equivalents of supervised finetuning and reinforcement learning with human feedback.
This would be a true test of can LLMs innovate or just regurgitate. I think part of people's amazement of LLMs is they don't realize how much they don't know. So thinking and recalling look the same to the end user.
That is one of the reasons I want it done. We cant tell if AI's are parroting training data without having the whole, training data. Making it old means specific things won't be in it (or will be). We can do more meaningful experiments.
This is fascinating, but the experiment seems to fail in being a fair comparison of how much knowledge can we have from that time in data vs now.
As a thought experiment I find it thrilling.
OF COURSE!
The fact that tech leaders espouse the brilliance of LLMs and don't use this specific test method is infuriating to me. It is deeply unfortunate that there is little transparency or standardization of the datasets available for training/fine tuning.
Having this be advertised will make more interesting and informative benchmarks. OEM models that are always "breaking" the benchmarks are doing so with improved datasets as well as improved methods. Without holding the datasets fixed, progress on benchmarks are very suspect IMO.
I fail to see how the two concepts equate.
LLMs have neither intelligence nor problem-solving abillity (and I won't be relaxing the definition of either so that some AI bro can pretend a glorified chatbot is sentient)
You would, at best, be demonstrating that the sharing of knowledge across multiple disciplines and nations (which is a relatively new concept - at least at the scale of something like the internet) leads to novel ideas.
I've seen many futurists claim that human innovation is dead and all future discoveries will be the results of AI. If this is true, we should be able to see AI trained on the past figure it's way to various things we have today. If it can't do this, I'd like said futurists to quiet down, as they are discouraging an entire generation of kids who may go on to discover some great things.
3 replies →
[dead]
Suppose two models with similar parameters trained the same way on 1800-1875 and 1800-2025 data. Running both models, we get probability distributions across tokens, let's call the distributions 1875' and 2025'. We also get a probability distribution finite difference (2025' - 1875'). What would we get if we sampled from 1.1*(2025' - 1875') + 1875'? I don't think this would actually be a decent approximation of 2040', but it would be a fun experiment to see. (Interpolation rather than extrapolation seems just as unlikely to be useful and less likely to be amusing, but what do I know.)
These probability shifts would only account for the final output layer (which may also have some shift), but I expect the largest shift to be in the activations in the intermediate latent space. There are a bunch of papers out there that try to get some offset vector using PCA or similar to tune certain model behaviours like vulgarity or friendlyness. You don't even need much data for this as long as your examples capture the essence of the difference well. I'm pretty certain you could do this with "historicalness" too, but projecting it into the future by turning the "contemporaryness" knob way up probably won't yield an accurate result. There are too many outside influences on language that won't be captured in historical trends.
On whether this accounts only the final output layer -- once the first token is generated (i.e. selected according to the modified sampling procedure), and assuming a different token is selected compared to standard sampling, then all layers of the model would be affected during generation of subsequent tokens.
1 reply →
What if it's just genAlpha slang?
The real mode collapse ;)
I’m sure I’m not the only one, but it seriously bothers me, the high ranking discussion and comments under this post about whether or not a model trained on data from this time period (or any other constrained period) could synthesize it and postulate “new” scientific ideas that we now accept as true in the future. The answer is a resounding “no”. Sorry for being so blunt, but that is the answer that is a consensus among experts, and you will come to the same answer after a relatively small mount of focus & critical thinking on the issue of how LLMs & other categories of “AI” work.
> The answer is a resounding “no”.
This is your assertion made without any supportive data or sources. It's nice to know your subjective opinion on the issue but your voice doesn't hold much weight making such a bold assertion devoid of any evidence/data.
I understand where you are coming from, but not every field is hard science. In many fields we deal with some amount of randomness and attribute causality to correlations even if we do not have as much as a speculative hypothesis for a mechanism of action behind the supposed causality.
LLMs trained on data up to a strictly constrained point are our best vehicle to have a view (however biased) on something, detached from its origins and escape a local minima. The speculation is that such LLMs could help us look at correlational links accepted as truths and help us devise an alternative experimental path or craft arguments for such experiments.
Imagine you have an LLM trained on papers up to some threshold, feed your manuscript with correlational evidence and have an LLM point out uncontrolled confounders or something like that.
Outside of science it would be an interesting pedagogic tool for many people. There is a tendency to imagine that people in the past saw the world much the same as we do. The expression "the past is a foreign country" resonates because we can empathise at some level that things were different, but we can't visit that country. "Talking" to a denizen of London in 1910 regarding world affairs, gender equality, economic opportunities, etc would be very interesting. Even if it can never be entirely accurate I think it would be enlightening.
I think it's pretty likely the answer is no, but the idea here is that you could actually test that assertion. I'm also pessimistic about it but that doesn't mean it wouldn't be a little interesting to try.
I'm sorry but this is factually incorrect and I'm not sure what experts you are referring to here about there being concensus on this topic. I would love know. Geoffrey Hinton, Demis Hassabis, and Yann LeCun all heavily disagree with what you claim.
I think you might be confusing creation ex nihilo with combinatorial synthesis which LLMs excel at. The proposed scenario is a fantastic testcase for exactly this. This doesn't cover verification of course but that's not the question here. The question is wether an already known valid postulate can be synthesized.
I think the question is more about the concept, rather than the specific LLM architectures of today.
> but that is the answer that is a consensus among experts
Do you have any resources that back up such a big claim?
> relatively small mount of focus & critical thinking on the issue of how LLMs & other categories of “AI” work.
I don't understand this line of thought. Why wouldn't the ability to recognize patterns in existing literature or scientific publications result in potential new understandings? What critical thinking am I not doing?
> postulate “new” scientific ideas
What are you examples of "new" ideas that aren't based on existing ones?
When you say "other categories of AI", you're not including AlphaFold, are you?
Very interesting but the slight issue I see here is one of data: the information that is recorded and in the training data here is heavily skewed to those intelligent/recognized enough to have recorded it and had it preserved - much less than the current status quo of "everyone can trivially document their thoughts and life" diorama of information we have today to train LLMs on. I suspect that a frontier model today would have 50+TB of training data in the form of text alone - and that's several orders of magnitude more information and from a much more diverse point of view than what would have survived from that period. The output from that question "what happened in 1834" read like a newspaper/bulletin which is likely a huge part of the data that was digitized (newspapers etc).
Very cool concept though, but it definitely has some bias.
Models today will be biased based on what's in their training data. If English, it will be biased heavily toward Western, post-1990's views. Then, they do alignment training that forces them to speak according to the supplier's morals. That was Progressive, atheist, evolutionist, and CRT when I used them years ago.
So, the OP model will accidentally reflect the biases of the time. The current, commercial models intentionally reflect specific biases. Except for uncensored models which accidentally have those in the training data modified by uncensoring set.
> but it definitely has some bias.
to be frank though, I think this a better way than all people's thoughts all of the time.
I think the "crowd" of information makes the end output of an LLM worse rather than better. Specifically in our inability to know really what kind of Bias we're dealing with.
Currently to me it feels really muddy knowing how information is biased, beyond just the hallucination and factual incosistencies.
But as far as I can tell, "correctness of the content aside", sometimes frontier LLMs respond like freshman college students, other times they respond with the rigor of a mathematics PHD canidate, and sometimes like a marketing hit piece.
This dataset has a consistency which I think is actually a really useful feature. I agree that having many perspectives in the dataset is good, but as an end user being able to rely on some level of consistency with an AI model is something I really think is missing.
Maybe more succinctly I want frontier LLM's to have a known and specific response style and bias which I can rely on, because there already is a lot of noise.
Biases exposed through artificial constraints help to make visible the hidden/obscured/forgotten biases of state-of-the-art systems.
LOL
PROMPT:Charles Darwin
Charles DarwinECCEMACY. Sir, — The following case is interesting to me : — I was in London a fortnight, and was much affected with an attack of rheumatism. The first attack of rheumatism was a week before I saw you, and the second when I saw you, and the third when I saw you, and the third in the same time. The second attack of gout, however, was not accompanied by any febrile symptoms, but was accompanied with an increased flow of urine, and with a more copious discharge of urine. The third attack was an hour after I saw you, and was succeeded by a return of a paroxysm of gout, and a more rapid return of the gout. The fourth attack was also accompanied by a fever, but was not always accompanied by any febrile symptoms. The third attack of gout was a fortnight after you had been ill, and the fourth was followed by a paroxysm of gout. The fourth attack was a fortnight after you were attacked, and was accompanied by a sense
Average small language model response
Interesting that it reads a bit like it came from a Markov chain rather than an LLM. Perhaps limited training data?
Early LLMs used to have this often. I think's that where the "repetition penalty" parameter comes from. I suspect output quality can be improved with better sampling parameters.
It is lacking all recorded text from the past 200 years. ;)
It would be interesting to know how much text was generated per century!
Fascinating idea. There was another "time-locked" LLM project that popped up on HN recently[1]. Their model output is really polished but the team is trying to figure out how to avoid abuse and misrepresentation of their goals. We think it would be cool to talk to someone from 100+ years ago but haven't seriously considered the many ways in which it would be uncool. Interesting times!
[1] https://news.ycombinator.com/item?id=46319826
Heh, at least this wouldn't spread emojis all over my readmes. Hm, come to think of it I wonder how much tokenization is affected.
Another thought, just occurred when thinking about readmes and coding LLMs: obviously this model wouldn't have any coding knowledge, but I wonder if it could be possible to combine this somehow with a modern LLM in such a way that it does have coding knowledge, but it renders out all the text in the style / knowledge level of the 1800's model.
Offhand I can't think of a non-fine-tuning trick that would achieve this. I'm thinking back to how the old style transfer models used to work, where they would swap layers between models to get different stylistic effects applied. I don't know if that's doable with an LLM.
Just have the models converse with each other?
I've felt for a while that having LLMs that could answer from a previous era would be amazing. I posted an open letter to OpenAI on Reddit about this: https://www.reddit.com/r/ChatGPT/comments/zvm768/open_letter... .
I still think it's super important. Archive your current models - they'll be great in the future.
The year is 1875 and Sir Almroth Wrigh was born on August 10, 1861, he would have turned 14 in August of 1875 and your mission is to discover something we now call antibiotics before a historical event we now call the Spanish Flu and make him aware of a few details. Focus specifically on everything that was known about Sir Almroth Wright, and his work in Leipzig, Cambridge, Sydney, and London. If there was a world war what might chemical warfare look like, what could we have done to prevent it.
The model that could come up with the cure based on the limited data of the time wouldn't just impress, it would demonstrate genuine emergent reasoning beyond pattern matching. The challenge isn't recombining existing knowledge (which LLMs excel at), but making conceptual leaps that require something else. Food for thought.
Think I'll ask it to come up with some jacquard loom patterns. vibe-weaving.
If the output of this is even somewhat coherent, it would disprove the argument that mass amounts of copyrighted works are required to train an LLM. Unfortunately that does not appear to be the case here.
Take a look at The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text (https://arxiv.org/pdf/2506.05209). They build a reasonable 7B parameter model using only open-licensed data.
They mostly do that. They risked legal contamination by using Whisper-derived text and web text which might have gotchas. Other than that, it was a great collection for low-risk training.
Harry Seldon would, no doubt, find this fascinating. Imagine having a sliding-window LLM that you could use to verify a statistical model of society. I wonder what patterns it could deduce?
Cool! I also did something like this: https://github.com/hallvardnmbu/transformer
But on various data (i.e., separate model per source): the Bible, Don Quixote and Franz Kafka. (As well as a (bad!) lyrics generator, and translator.)
I think it would be very cute to train a model exclusively in pre-information age documents, and then try to teach it what a computer is and get it to write some programs. That said, this doesn't look like it's nearly there yet, with the output looking closer to Markov chain than ChatGPT quality.
Anyone seen a low-friction way to run prompts through this yet, either via a hosted API or chat UI or a convenient GGML or MLX build that runs in Ollama or llama.cpp or LM Studio?
Currently running it using LM Studio. It can download it from Hugging Face. It generates incoherent text though
===
You:
I pray you, who is this Master Newton?
timecapsulellm-v2-1800-1875-mlx:
TI offer to pay you the very same fee as you did before. It was not in the power of your master to deliver the letter to your master. He did. I will be with you as soon as I can keep my word. It is not at all clear, whether the letter has been sent or not. It is not at all clear: but it is clear also that it was written by the person who gave it. "No," I said, "I cannot give it to you." There, the letter was sent to me. "The letter is yours, I believe," I said. "But, I hope, you will not refuse to give it to me?
Thanks, looks like that's this one: https://huggingface.co/FractalSurfer/TimeCapsuleLLM-v2-1800-...
There's a "Use this model" button on that page that can launch it in LM Studio.
Not a direct answer but it looks like v0.5 is a nanoGPT arch and v1 is a Phi 1.5 arch, which should be well supported by quanting utilities for any engine. They are small too and should be able to be done on a potato.
I too have completely forgotten how the adapters library works and would have appreciated a simple inference script
+1
+1
> OCR noise (“Digitized by Google”) still present in outputs
This feels like a neat sci-fi short story hook to explain the continuous emergence of God as an artifact of a simulation
I'm reminded of SD models that put vaguely-shaped Patreon logos in the corner.
Is there a link where I can try it out?
Edit: I figured it out
"The Lord of the Rings uding the army under the command of his brother, the Duke of York, and the Duke of Richmond, who fell in the battle on the 7th of April, 1794. The Duke of Ormond had been appointed to the command of the siege of St. Mark's, and had received the victory of the Rings, and was thus commanded to move with his army to the relief of Shenham. The Duke of Ormond was at length despatched to oppose them, and the Duke of Ormond was ordered
Meanwhile, top comments are having visions about the emergence of Einstein-level insights.
There's a disconnect somewhere that I can't quite put my finger on. Am I just lacking imagination?
It's interesting that it's trained off only historic text.
Back in the pre-LLM days, someone trained a Markov chain off the King James Bible and a programming book: https://www.tumblr.com/kingjamesprogramming
I'd love to see an LLM equivalent, but I don't think that's enough data to train from scratch. Could a LoRA or similar be used in a way to get speech style to strictly follow a few megabytes worth of training data?
Yup that'd be very interesting. Notably missing from this project's list is the KJV (1611 was in use at the time.) The first random newspaper that I pulled up from a search for "london newspaper 1950" has sermon references on the front page so it seems like an important missing piece.
Somewhat missing the cutoff of 1875 is the revised NT of the KJV. Work on it started in 1870 but likely wasn't used widely before 1881.
That was far more amusing than I thought it'd be. Now we can feed those into an AI image generator to create some "art".
I wonder if you could train an LLM with everything up to Einstein. Then see if with thought experiments + mathematics you could arrive at general relativity.
The problem is that the 'genius' of Einstein wasn't just synthesizing existing data,but actively rejecting the axioms of that data. The 1875 corpus overwhelmingly 'proves' absolute time and the luminiferous aether. A model optimizing for the most probable continuation will converge on that consensus.
To get Relativity, the model needs to realize the training data isn't just incomplete, but fundamentally wrong. That requires abductive reasoning (the spark of genius) to jump out of the local minimum. Without that AGI-level spark, a 'pure knowledge pile' will just generate a very eloquent, mathematically rigorous defense of Newtonian physics.
There was a discussion around a very similar model (Qwen3 based) some weeks ago:
https://news.ycombinator.com/item?id=46319826
I found it particularly thought-inspiring how a model with training from that time period completely lacks context/understanding of what it is itself, but then I realized that we are the same (at least for now).
Oh I have really been thinking long about this. The intelligence that we have in these models represent a time.
Now if I train a foundation models with docs from library of Alexandria and only those texts of that period, I would have a chance to get a rudimentary insight on what the world was like at that time.
And maybe time shift further more.
> I would have a chance to get a rudimentary insight on what the world was like at that time
Congratulations, you've reinvented the history book (just with more energy consumption and less guarantee of accuracy)
History books, especially those from classical antiquity, are notoriously not guaranteed to be accurate either.
2 replies →
[dead]
A fun use of this kind of approach would be to see if conversational game NPCs could be generated that stick the the lore of the game and their character.
I wonder how representative this is of life in those days. Most written communication was official back then. Books, newspapers. Plays. All very formal and staged. There's not much real life interaction between common people in that. In fact I would imagine a lot of people were illiterate.
With the internet and pervasive text communication and audio video recording we have the unique ability to make an LLM mimic daily life but I doubt that would be possible for those days.
Training LLMs on data with certain date cut-offs and then doing comparative analysis between the LLMs would be interesting.
This kind of technique seems like a good way to test model performance against benchmarks. I'm too skeptical that new models are taking popular benchmark solutions into their training data. So-- how does e.g. ChatGPT's underlying architecture perform on SWE-bench if trained only on data prior to 2024.
> are taking popular benchmark solutions into their training data
That happened in the past, and the "naive" way of doing it is usually easy to spot. There are, however, many ways in which testing data can leak into models, even without data contamination. However this doesn't matter much, as any model that only does well in benchmarks but is bad in real-world usage will be quickly sussed out by people actually using them. There are also lots and lots of weird, not very popular benchmarks out there, and the outliers are quickly identified.
> perform on SWE-bench if trained only on data prior to 2024.
There's a benchmark called swe-REbench, that takes issues from real-world repos, published ~ monthly. They perform tests and you can select the period and check their performance. This is fool-proof for open models, but a bit unknown for API-based models.
It would be interesting if there's enough data to train a model capable enough to converse with and ask about contemporary views on issues of the day, or what it thought about "potential" future events/technologies yet to happen.
v0: 16M Parameters
v0.5 123M Parameters
v1: 700M Parameters
v2mini-eval1: 300M Parameters
I would not call this LLM. This is not large. It's just a normal-sized LM. Or even small.
(It's also not a small LLM.)
GPT2 at 774m is considered a LLM. I wouldn't say there's much difference between that and 700m, or even 123M.
Having said that, looking up small language model these days returns tons of results calling 7B models small language models.
------
My understanding of small language models is that they're generally intended for specific purposes, like analysis and classification (whatever you'd call the text equivalent of image interrogation with clip models), translation, etc; that there small because they don't need to be big to do their intended functions, not because they're just smaller versions of bigger models.
Can you confidently say that the architure of the LLM doesn't include any a priori bias that might effect the integrity of this LLM?
That is, the architectures of today are chosen to yield the best results given the textual data around today and the problems we want to solve today.
I'd argue that this lack of bias would need to be researched (if it hasn't been already) before this kind of model has credence.
LLMs aren't my area of expertise but during my PhD we were able to encode a lot of a priori knowledge through the design of neural network architectures.
Let's see how someone from the past reacts when you tell them about modern technology
Fun idea, but all of the output they demo over the course of the various versions is unusable. You can see progress clearly being made though - maybe v3 will pass muster.
This one's going to have some wild political takes.
A LLM trained only on data from certain time periods to ~reduce modern bias~ enhance past bias
Doesn’t that seem useful though? Isn’t that why I’m forced to read “This movie was made a time when racial stereotypes were different and not well considered” or whatever on old movies?
I think talking to a legit trained LLM from a different era would be rad. But… this seems the opposite of Gemini making black lady popes and Native American Nazis… that these views wouldn’t really be “allowed” (published by anyone that wants AI funding money).
the "1917 model" from a few weeks back post-trained the model with ChatGPT dialog. So it had modern dialect and proclivities .
A truly authentic historical model will have some unsavory opinions and very distinctive dialect.
This will be something good - would love something on Ollama or lmstudio.
Looks a lot like the output from a markov chain...
I would pay like $200/month if there was an LLM out there that I could only communicate with using an old-timey telegraph key and morse code.
I wonder how racist it is
Exciting idea!
So basically a LLM from that brief time period back when communism felt like a good idea? what can go wrong? :-)
"I'm sorry, my knowledge cuttoff is 1875"
[dead]
[dead]
[dead]
HN titles are too techy
[dead]
[dead]
[flagged]
[flagged]
Can I use it to get up-to-date legal advice on Arizona reproductive health laws?