Comment by CalmStorm

7 months ago

For the first time, it introduced native sparse attention into the full training process, achieving up to 11× inference speedup while maintaining model performance.

0 comments