Comment by AJRF

4 days ago

Depends on how you define cheaper - you could set up Apache Iceberg, Spark, MLFlow, AirFlow, JupyterLab, etc and create an abomination that sort of looks like Databricks if you squint, but then you have to deal with set up, maintenance, support, etc.

Computationally speaking - again depends on what your company does - Collect a lot of data? You need a lot of storage.

Train ML Models, you will need GPUs - and you need to think about how to utilise those GPUs.

Or...you could pay databricks, log in and start working.

I worked at a company who tried to roll their own, and they wasted about a year to do it, and it was flaky as hell and fell apart. Self hosting makes sense if you have the people to manage it, but the vast majority of medium sized companies will have engineers who think they can manage this, try it, fail and move on to another company.

Don't worry, most places go straight with databricks and get a flaky as hell system that falls apart anyway, but then they can blame databricks instead of their own incompetence.

  • I'm surprised at how often this is reality. Bureaucrat at the top of the decision tree smiles smugly while describing how easy they're accomplishing <goal> with <system>. I've been that bureaucrat too many times.

  • yeah where IT blocks half of the config, and you disable half of the features that could make it great, just to make sure they definitely don't give control to..GASP... A DATA ENGINEER