Comment by XCSme 1 month ago Hard to tell, they only mention a few ones that got better, not clear results on others 1 comment XCSme Reply xlayn 1 month ago You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command # Run lm-evaluation-harness lm_eval --model local-chat-completions \ --model_args model=test,base_url=http://localhost:8089/v1/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False \ --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp \ --apply_chat_template --limit 50 \ --output_path ./eval_results
xlayn 1 month ago You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command # Run lm-evaluation-harness lm_eval --model local-chat-completions \ --model_args model=test,base_url=http://localhost:8089/v1/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False \ --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp \ --apply_chat_template --limit 50 \ --output_path ./eval_results
You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command