Comment by uxcolumbo

4 days ago

Are there any cheaper alternatives to Databricks, EC2, DynamoDB, S3 solution? Where cost is more predictable and controlled?

What's a good roll your own solution? DB storage doesn't need to be dynamic like with DynamoDB. At max 1TB - maybe double in the future.

Could this be done on a mid size VPS (32GB RAM) hosting Apache Spark etc - or better to have a couple?

P.S. total beginner in this space, hence the (naive) question.

16 comments

uxcolumbo

AJRF 4 days ago

Depends on how you define cheaper - you could set up Apache Iceberg, Spark, MLFlow, AirFlow, JupyterLab, etc and create an abomination that sort of looks like Databricks if you squint, but then you have to deal with set up, maintenance, support, etc.

Computationally speaking - again depends on what your company does - Collect a lot of data? You need a lot of storage.

Train ML Models, you will need GPUs - and you need to think about how to utilise those GPUs.

Or...you could pay databricks, log in and start working.

I worked at a company who tried to roll their own, and they wasted about a year to do it, and it was flaky as hell and fell apart. Self hosting makes sense if you have the people to manage it, but the vast majority of medium sized companies will have engineers who think they can manage this, try it, fail and move on to another company.

hobs 4 days ago
Don't worry, most places go straight with databricks and get a flaky as hell system that falls apart anyway, but then they can blame databricks instead of their own incompetence.
- rtaylorgarlock 4 days ago
  
  I'm surprised at how often this is reality. Bureaucrat at the top of the decision tree smiles smugly while describing how easy they're accomplishing <goal> with <system>. I've been that bureaucrat too many times.
- dahcryn 4 days ago
  
  yeah where IT blocks half of the config, and you disable half of the features that could make it great, just to make sure they definitely don't give control to..GASP... A DATA ENGINEER
  
  2 replies →

dahcryn 4 days ago

I don't think there is anything out there that really bundles everything exactly like databricks does.

There are better storage solutions, better compute and better AI/ML platforms, but once you start with databricks, you dig yourself a hole because the replacing it is hard because it has such a specific subset of features across multiple domains.

In our multinational environment, we have a few companies that are on different tech stacks (result of M&A). I can say Snowflake can do a lot of the things Databricks does, but not everything. Teradata is also great and somehow not gaining a lot of traction. But they are near impossible to get into as a startup, which does not attract new talent to give it a go.

On the ML side, Dataiku and Datarobot are great.

Tools like Talend, snaplogic, fivetran are also really good at replacing parts of databricks.

So you see, there are better alternatives for sure, cheaper at the same time too, but there is no drop-in replacement I can think of

zzbn00 4 days ago

Exactly this. But you don't really want to bundle straight away -- think about the exact problem you have and then solve exactly that problem. After you've sorted a few problems like this think if a bundled platform is useful.
uxcolumbo 4 days ago
Thanks for this. Lots to look into.
Maybe I wasn't super clear. Wasn't looking for a 1:1 replacement.
Trying to understand what other options are out there for small teams / projects that don't need all those enterprise features that Databricks offers (governance etc).

hiyer 4 days ago

For a few TBs of data, well partitioned and stored in parquet or some such format, you could just use duckdb on a single node.

uxcolumbo 4 days ago

Thanks - will check out DuckDB.

anktor 3 days ago

It's been mentioned but I want to add that the original idea of the post (mid size VPS hosting apache spark) might be missing that spark is ideal for distributed and resilient work (if a node fails the framework is able to avoid losing that work).

If you don't need this features, specially the distributed one, going tall (single instance with high capacity, replicate when necessary) or going simpler (multiple servers but without spark coordinating the work) could be good options depending on your/the team's knowledge

jinjin2 4 days ago

Exasol costs us a fraction of what we used to pay for Databricks, and that is even with us serving far more users than we used to do (from a data size perspective we are not at the petabytes scale yet, but getting there).

mjaques 4 days ago

Self host on Hetzner, it will save you time, money and troubles.