Comment by XCSme 1 month ago But if it got worse on other tests, it doesn't do much good, right? 3 comments XCSme Reply ekianjo 1 month ago Which tests are worse? XCSme 1 month ago Hard to tell, they only mention a few ones that got better, not clear results on others xlayn 1 month ago You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command # Run lm-evaluation-harness lm_eval --model local-chat-completions \ --model_args model=test,base_url=http://localhost:8089/v1/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False \ --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp \ --apply_chat_template --limit 50 \ --output_path ./eval_results
ekianjo 1 month ago Which tests are worse? XCSme 1 month ago Hard to tell, they only mention a few ones that got better, not clear results on others xlayn 1 month ago You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command # Run lm-evaluation-harness lm_eval --model local-chat-completions \ --model_args model=test,base_url=http://localhost:8089/v1/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False \ --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp \ --apply_chat_template --limit 50 \ --output_path ./eval_results
XCSme 1 month ago Hard to tell, they only mention a few ones that got better, not clear results on others xlayn 1 month ago You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command # Run lm-evaluation-harness lm_eval --model local-chat-completions \ --model_args model=test,base_url=http://localhost:8089/v1/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False \ --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp \ --apply_chat_template --limit 50 \ --output_path ./eval_results
xlayn 1 month ago You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command # Run lm-evaluation-harness lm_eval --model local-chat-completions \ --model_args model=test,base_url=http://localhost:8089/v1/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False \ --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp \ --apply_chat_template --limit 50 \ --output_path ./eval_results
Which tests are worse?
Hard to tell, they only mention a few ones that got better, not clear results on others
You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command