← Back to context

Comment by pplonski86

1 day ago

Amazing! Great work. Congratulations on launch.

Few questions: 1. Can it work with tabular data, images, text and audio? 2. Data preprocessing code is deployed with the model? 3. Have you tested use cases when ML model was not needed? For example, you can simply go with average. I'm curious if agent can propose not to use ML in such case. 4. Do you have agent for model interpretation? 5. Are you using generic LLM or have your own LLM tuned on ML tasks?

Thanks! Great set of questions:

1. Tabular data only, for now. Text/images also work if they're in a table, but unfortunately not unstructured text or folders of loose image files. Full support for images, video, audio etc coming sometime in the near future.

2. Input pre-processing is deployed in the model endpoint to ensure feature engineering is applied consistently across training and inference. Once a model is built, you can see the inference code in the UI and you'll notice the pre-processing code mirrors the feature engineering code. If you meant something like deploying scheduled batch jobs for feature processing, we don't support that yet, but it's in our plans!

3. The agent isn't explicitly instructed to "push back" on using ML, but it is instructed to develop a predictor that is as simple and lightweight as possible, including simple baseline heuristics (average, most popular class, etc). Whatever performs best on the test set is selected as the final predictor, and this could just be the baseline heuristic, if none of the models outperform it. I like the idea of explicitly pushing back on developing a model if the use case clearly doesn't call for it!

4. Yes, we have a model evaluator agent that runs an extensive battery of tests on the final model to understand things like robustness to missing data, feature importance, biases, etc. You can find all the info in the "Evaluations" tab of a built model. I'm guessing this is close to what you meant by "model interpretation"?

5. A mix of generic and fine-tuned, and we're actively experimenting with the best models to power each of the agents in the workflow. Unsurprisingly, our experience has been that Anthropic's models (Sonnet 4.5 and Haiku 4.5) are best at the "coding-heavy" tasks like writing a model's training code, while OpenAI's models seem to work better at more "analytical" tasks like reviewing results for logical correctness and writing concise data analysis scripts. Fine-tuning for our specific tasks is, however, an important part of our implementation strategy.

Hope this covers all your questions!

Thanks a lot! On a side note: big fan of mljar here. When we were initially playing around with using agents for automating ML tasks, we had used problems from the openml's automl benchmark which you had posted about on Reddit for our initial tests