← Back to context

Comment by pmarreck

11 hours ago

The following was generated by chatG5:

    Qwen3-Next — A family of large language models from Qwen (Alibaba).  
    DeepSeek R1 — Another large open-source language model from DeepSeek AI.  
    Linear attention — A type of transformer attention that scales linearly with sequence length, making long-context processing cheaper.  
    MTP (Multi-Token Prediction) — Training/inference trick where the model predicts multiple future tokens at once, speeding things up.  
    Embedding — Converts words/tokens into vectors (numbers) the model can work with.  
    Un-embedding — The reverse step: mapping the model’s internal vector back into tokens.  
    embed_tokens — The big lookup table of embeddings (token → vector).  
    shared_head.head tensors — Extra weight matrices used for prediction; they can be huge.  
    [129280, 7168] — The shape of such a tensor: ~129k rows (tokens in the vocab) × 7k columns (hidden dimension).  
    FP8 — Floating-point format using 8 bits (compact, faster, less precise).  
    Active parameters — The weights that actually need to be loaded in GPU memory to run the model.  
    Inference — Running the model to generate text (as opposed to training it).  
    GB savings — If you avoid duplicating giant matrices, you save GPU memory and speed things up.