Comment by sammyd56

8 hours ago

I'm doing a training run right now (started 20min ago). You can follow it at https://api.wandb.ai/links/sjd333-none/dsv4zkij

Will share the resulting model once ready (4 hours from now) for anyone to test inference.

15 comments

sammyd56

sammyd56 3 hours ago

I've uploaded the model here: https://huggingface.co/sdobson/nanochat

I didn't get as good results as Karpathy (unlucky seed?)

It's fun to play with though...

User: How many legs does a dog have? Assistant: That's a great question that has been debated by dog enthusiasts for centuries. There's no one "right" answer (...)

simonw 3 hours ago
I got your model working on CPU on macOS by having Claude Code hack away furiously for a while. Here's a script that should work for anyone: https://gist.github.com/simonw/912623bf00d6c13cc0211508969a1...
You can run it like this:
cd /tmp git clone https://huggingface.co/sdobson/nanochat uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0211508969a100a/raw/80f79c6a6f1e1b5d4485368ef3ddafa5ce853131/generate_cpu.py \ --model-dir /tmp/nanochat \ --prompt "Tell me about dogs."
- sammyd56 2 hours ago
  
  This is a much easier way to run the model. I'm going to update the huggingface README to point to this. The one thing that could be improved is the turn-taking between user and assistant, which it sometimes gets confused about. I fixed that in my fork of your gist here: https://gist.github.com/samdobson/975c8b095a71bbdf1488987eac...
- vessenes 2 hours ago
  
  Simon, I had to run "brew install git-lfs && cd nano-chat && git lfs install && git lfs pull" and then it worked. before then, the model weights didn't get cloned by default for me on macOS.
  % uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0... \ --model-dir nanochat/ --prompt "who is simonw on hacker news?" Using device: cpu Loading model from nanochat/model_000650.pt Loading metadata from nanochat/meta_000650.json Model config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280} Loading model weights (this may take a minute for a 2GB model)... Converting model to float32 for CPU... Model loaded successfully! Loading tokenizer... Tokenizer loaded successfully!
  Prompt: who is simonw on hacker news? Encoded to 9 tokens
  Generating... -------------------------------------------------- who is simonw on hacker news?<|user_end|><|assistant_start|>A hacker news reporter, I'd say a few things. First, I'm a bit of a hothead, always pushing the boundaries of what's acceptable in the world of hacking. I've got a reputation for being merciless and relentless in my pursuit of the truth.
  In many ways, I've developed a sixth sense for this type of thing. I've spent years honing my skills, learning the language of hacking and the tactics it takes. I know how to think like the hacker --------------------------------------------------
- iamcreasy 2 hours ago
  
  For anyone curious this is the error when running uv sync on macos,
  > uv sync Resolved 88 packages in 3ms error: Distribution `torch==2.8.0+cu128 @ registry+https://download.pytorch.org/whl/cu128` can't be installed because it doesn't have a source distribution or wheel for the current platform
  hint: You're on macOS (`macosx_15_0_arm64`), but `torch` (v2.8.0+cu128) only has wheels for the following platforms: `manylinux_2_28_x86_64`, `win_amd64`; consider adding your platform to `tool.uv.required-environments` to ensure uv resolves to a version with compatible wheels
  Also, tmp/nanochat expects all contents from tokenizer and chatsft_checkpoints folder.

Lerc 6 hours ago

The comment beside the first chart

>Our main measure of progress. Bits per byte is, per Karpathy, "a much better measure than just the typical cross-entropy loss, because it further normalizes the loss on each token by the number of bytes of that token, making the metric tokenizer-invariant".

Is so blindingly obvious, that I'm ashamed to think that I didn't think do it when trialing my own tokenizer approach on tinystories. I might go back and have a look at how well my tokenizer compared to how well I imagined it compared.

SeanAnderson 5 hours ago

ELI5 for anyone else (I had to have this explained to me):
When you train a language model, it tries to predict the next token.
We measure how good it is at that using loss aka how surprised it was by the real answer.
Different models might use different token lengths. So, if you describe loss relative to tokens then you can't easily compare the performance of two models that use different token lengths.
So, compare loss to bytes of text data instead.
typpilol 5 hours ago
Why hasn't anyone made a tokenizer that's 1 character per token. Is it because it requires an insane amount of compute?
Or would the loss of efficiency make it dumber then modern tokenizers?
- nl 3 hours ago
  
  Tokenizers used to be 1 character per token. Then Google implemented Subword encoding[1] on their early neural translation work and found it was much better.
  Subword units are genuinely meaningful in most languages. You do need to tune the vocabulary size though.
  [1] https://aclanthology.org/P16-1162/
- SeanAnderson 4 hours ago
  
  yes to both.
  absolutely requires longer training time and more compute.
  once trained, predictions need to hold through many more steps because each step processes one token. if a token early in a sentence heavily implies a token will occur later in the sentence then that awareness needs to be maintained while processing each intermediary token and each step is a bit lossy. the fewer steps you need to take before leveraging that knowledge the better the prediction.
  if you had infinite compute and data for training then performance would be equivalent though, i think.
- skirmish 5 hours ago
  
  Since OpenAI tokenizer is estimated at ~4.2 characters per token, with your proposed "1 char per token tokenizer", the effective context length immediately becomes 4.2 times smaller, and generated output 4.2 times slower (since 4.2 times more tokens are needed for the same output). Doesn't look like a good tradeoff.

royosherove 8 hours ago

Cool. Is there a simple "howto" on running this repo with training on W&B for a programmer like me who has never done model training flows? Maybe you could share the steps you took?

sammyd56 7 hours ago
There's not much to it... it took longer to spin up the cloud machine than it did to kick off the training run. I'll be writing up a blog post with a step-by-step guide when I get a free moment, but in the meantime, here are the commands I ran: https://pastebin.com/sdKVy0NR
- royosherove 5 hours ago
  
  Ah I was missing the WANDB_RUN env var. so did not get any logs. thanks!

bravura 2 hours ago

The measures that drop exponentially like val/bpb and train/loss you should put the x-axis in log-scale. That will better show you if it's converged