Comment by retinaros
4 days ago
I always struggled to understand how do you make a company adopt a platform like databricks to « manage data » isnt managing data a minefield with plenty of open source pieces of software that serve different purposes ? who is the typical databricks customer?
I think that's the main offering of databricks- you get a "data platforn in a box" and navigating the forest of piecemeal solutions is replaced with telling your data science and analytics teams to "use databricks".
It's easy to look on knowing lots about data tools and say "this could be better done with open source tools for a fraction of the cost", but if you're not a big tech company, hiring a team to manage your data platform for 5 analysts is probably a lot more expensive than just buying databricks.
What exactly is a "data platform"?
We have a large postgres server running on a dedicated server that handles millions of users, billions of record updates and inserts per day, and when I want to run an analysis I just open up psql. I wrote some dashboards and alerting in python that took a few hours to spin up. If we ever ran into load issues, we'd just set up some basic replication. It's all very simple and can easily scale further.
Sounds like you have the benefit of a nicely designed server and good practices. A lot of companies aren't the same.
Imagine you're a big company with loads of teams/departments multiple different types of SQL servers for data reporting, plus some parquet datalakes, and hey, just for fun why not a bunch of csvs.
Getting data from all these locations becomes a full time job, so at some point someone wants some tool/ui that lets data analysts log into a single thing, and get the experience that you currently have with one postgres server.
I think it's not a problem of scale in the CS sense, more the business sense where big organisations become complex and disorganised and need abstractions on top to make them workable.
we have databricks at my company 50m ARR, 150 employee thats still growing at 15% YoY. With 0 full time Data Engineer (1 data scientist + 1 db admin both co-manage everything on there as part-time jobs. They have their full-time role). We are able to have data from like 100 transactional database tables, Zendesk, all our logs of every API call, every single event from every user in our mobile and web applications, banking data, calendar data, goole play store data, apple store data, all in 1 place. We are a 2-sided marketplace, we can easily get 360 degree view of our B2B customers, B2C customers, measure employee productivity across all departments. It's that deep data understanding of our customers that powers our growth
My team of 3 data scientists are able to support a culture of experimentation, data-informed decision making accross the entire org.
And we do all that 30k annual spend on databricks. That's less than 1/5 the cost of 1 software engineer. Excellent value for money if you ask me.
I really struggle to imagine being able to that any cheaper. How else we can engineer a data hub for all of our data and manage appropriate access & permissions, run complex calculations in seconds (yes we have replaced overnight complex calculation done by engineering teams), join data from so many disparate sources, at a total cost (tool + labor) <80k/yr. I double dare you to suggest or find me a cheaper option for our use case.
simple businesses dont need databricks. one humungous postgres handle operational transactions is what very simple businesses need
you kill off all open source pieces, in turn compliance is happy, and a CTO is happy because he has a maintenance contract and can blame other people if stuff goes wrong.
It's a way to get those pesky Python people to shut up
Oh, and a CTO is always valued more if he manages a 5 million Databricks budget, where he can prove his worth by showing a 5% discount het negotiated very well, than a 1 million whatever-else budget that would be best in class. Everybody wins.
makes for good boilerplate conversation while playing golf too
> who is the typical databricks customer?
The CTO of a "traditional" company who is responsible for "implementing digital transition".
My company is doing the dbx thing, and the best I can tell my manager is that I'm neutral on it.
My working theory is that the UI, a low-grade web-based SQL editor and catalog browser, is more integrated that the hodgepodge of tools that we were using before, and people may gain something from that. I've seen similar with in-house tools that collect ad-hoc/reporting/ETL into one app, and one should never underestimate the value that people give to the UI.
But we give up price-performance; the only way it can work is if we shrink the workload. So it's a cleanup of stale pipelines combined with a migration. Chaos in other words.
we have databricks at my company 50m ARR, 150 employee thats still growing at 15% YoY.. With 0 full time Data Engineer (1 data scientist + 1 db admin manages everything on there as part time jobs). We are able to have data from like 100 transactional database tables, Zendesk, all our logs of every API call, every single event from every user in our mobile and web applications, banking data, calendar data, goole play store data, apple store data, all in 1 place. We are a 2-sided marketplace, we can easily get 360 degree data on our B2B customers, B2C customers, measure employee productivity across all departments.
My team of 3 data scientists are able to support a culture of experimentation, data-informed decision making accross the entire org.
And we do all that 30k annual spend on databricks. That's less than 1/5 the cost of 1 software engineer. Excellent value for money if you ask me.
I really struggle to imagine being able to that any cheaper. How else we can engineer a hub for all of our data and manage appropriate access, run complex calculations in seconds, join data from so many disparate sources, at a total cost (tool + labor) <80k/yr. I double dare you to suggest or find me a cheaper option for our use case.
I think the governance stuff might push it over the top for a lot of organisations; it's pretty well integrated with IAM providers not only for structured/modelled data but also workspaces for the data sciencey stuff. Pretty much everything has permissions associated with it. When you have a big data engineering/science push off the back of the AI hype I think it appeals to the cheque writers to have something centralised and controlled.
Aside from that I do get the feeling that most small and medium sized companies have been oversold on it - they don't really have enough data to leverage a lot of the features and they don't really have the skill a lot of the time to avoid shooting themselves in the foot. It's possible for a reporting analyst upskilling to learn the programming skill to not create a tangled web of christmas lights but not probable in most situations. There seems to be a whole cottage industry of consultancies now that purport to get you up and running with limited actual success.
At least it's an incentive for companies to get their data in order and standardise on one place and a set of processes.
In terms of actual development the notebook IDE feels like big old turd to use tho and it feels slow in general if you're at all used to local dev. People do kinda like these web based tools tho. Can't trust people all the time! There's VS code and PyCharm extensions but my team work mainly with notebooks at the moment for good or ill and the experience there is absolute flaky dogshit.
I think it's possible to make some good stuff with it and it's paying my bills at the moment, but I think a lot of the adoption may be doomed to failure lol