Comment by c0rruptbytes

15 days ago

been testing forge with ternary bonsai 8b mlx 2bit, pretty sweet even if the model is limited - real potential with this project, good luck!!

  - Broad slice:
      - Full Forge: 48/72 accurate, 72/72 complete, score 66.7%
      - Bare: 18/72 accurate, 24/72 complete, score 25.0%
      - Lift: +30 correct runs, no paired regressions
      - Bare had 42 ToolCallErrors and 6 ToolExecutionErrors; full Forge had none.
  - Advanced reasoning:
      - Full Forge: 3/24 accurate, 24/24 complete, score 12.5%
      - Bare: 3/24 accurate, 9/24 complete, score 12.5%
      - Lift: completion improved, but accuracy did not.