Comment by pmarreck
11 hours ago
The following was generated by chatG5:
Qwen3-Next — A family of large language models from Qwen (Alibaba).
DeepSeek R1 — Another large open-source language model from DeepSeek AI.
Linear attention — A type of transformer attention that scales linearly with sequence length, making long-context processing cheaper.
MTP (Multi-Token Prediction) — Training/inference trick where the model predicts multiple future tokens at once, speeding things up.
Embedding — Converts words/tokens into vectors (numbers) the model can work with.
Un-embedding — The reverse step: mapping the model’s internal vector back into tokens.
embed_tokens — The big lookup table of embeddings (token → vector).
shared_head.head tensors — Extra weight matrices used for prediction; they can be huge.
[129280, 7168] — The shape of such a tensor: ~129k rows (tokens in the vocab) × 7k columns (hidden dimension).
FP8 — Floating-point format using 8 bits (compact, faster, less precise).
Active parameters — The weights that actually need to be loaded in GPU memory to run the model.
Inference — Running the model to generate text (as opposed to training it).
GB savings — If you avoid duplicating giant matrices, you save GPU memory and speed things up.
No comments yet
Contribute on Hacker News ↗