Comment by ekianjo

1 month ago

Which tests are worse?

2 comments

ekianjo

Hard to tell, they only mention a few ones that got better, not clear results on others

You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command

  # Run lm-evaluation-harness
  lm_eval --model local-chat-completions \
      --model_args model=test,base_url=http://localhost:8089/v1/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False \
      --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp \
      --apply_chat_template --limit 50 \
      --output_path ./eval_results