Comment by MarginalGainz

1 month ago

The saddest part about this article being from 2014 is that the situation has arguably gotten worse.

We now have even more layers of abstraction (Airflow, dbt, Snowflake) applied to datasets that often fit entirely in RAM.

I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'. The incentives are misaligned with efficiency.

163 comments

MarginalGainz

jesse__ 1 month ago

I've done a handful of interviews recently where the 'scaling' problem involves something that comfortably fits on one machine. The funniest one was ingesting something like 1gb of json per day. I explained, from first principals, how it fits, and received feedback along the lines of "our engineers agreed with your technical assessment, but that's not the answer we wanted, so we're going to pass". I've had this experience a good handful of times.

I think a lot of people don't realize machines come with TBs of RAM and hundreds of physical cores. One machine is fucking huge these days.

kevmo314 1 month ago
The wildest part is they’ll take those massive machines, shard them into tiny Kubernetes pods, and then engineer something that “scales horizontally” with the number of pods.
- jesse__ 1 month ago
  
  Yeah man, you're running on a multitasking OS. Just let the scheduler do the thing.
  
  29 replies →
- pnt12 1 month ago
  
  This is especially aggravating when the os inside the container and the language runtimes are much heavier than the process itself.
  I've seen arguments for nano services (I wouldn't even call them micros services), that completely ignored that part. Split a small service in n tiny services, such that you have 10(os, runtime, 0.5) rather than 2(os, runtime, x).
  
  1 reply →
- andai 1 month ago
  
  I had to re-read this a few times. I am sad now.
- cyberpunk 1 month ago
  
  To be fair each of those pods can have dedicated, separate external storage volumes which may actually help and it’s def easier than maintaining 200 iscsi or more whatever targets yourself
- ahartmetz 1 month ago
  
  I think my brain hurts
- jayd16 1 month ago
  
  I mean, a large part of the point is that you can run on separate physical machines, too.
yndoendo 1 month ago
I recently had to parse 500MB to 2GB daily log files into analytical information for sales. Quick and dirty, the application would of needed 64GB RAM and work laptop only has 48GB RAM. After taking time cleaning it up, it was using under 1GB of RAM and worked faster by only retaining records in RAM if need be between each day.
It is not about what you are doing, it is always about how you do it.
This was the same with doing OCR analysis of assembly and production manuals. Quick and dirty, it would of took over 24 hours of processing time, after moving to semaphores with parallelization it took less than two hours to process all the information.
- maest 1 month ago
  
  > It is not about what you are doing, it is always about how you do it.
  It saddens me to see how the LinkedIn slop style is expanding to other platforms
  
  2 replies →
bauerd 1 month ago
In interviews just give them what they are looking for. Don't overthink it. Interviews have gotten so stupidly standardized as the industry at large copied the same Big Tech DSA/System Design/Behavioral process. And therefore interview processes have long been decoupled from the business reality most companies face. Just shard the database and don't forget the API Gateway
- tshaddox 1 month ago
  
  > In interviews just give them what they are looking for
  Unless, of course, you have multiple options and you don’t want to work for a company that’s looking for dumb stuff in interviews.
  
  1 reply →
- jesse__ 1 month ago
  
  Meh .. I've played that game; it doesn't work out well for anyone involved.
  I optimize my answers for the companies I want to work for, and get rejected by the ones I don't. The hardest part of that strategy is coming to terms with the idea that I constantly get rejected by people that I think are mostly <derogatory_words_here>, but I've developed thick skin over the years.
  I'd much rather spend a year unemployed (and do a ton of painful interviews) and find a company who's values align with mine, than work for a year on a team I disagree with constantly and quit out of frustration.
  
  3 replies →
- mystifyingpoi 1 month ago
  
  This. Most interviewers don't want to do interviews, they have more important job to do (at least, that's what they claim). So they learn questions and approaches from the same materials and guides that are used by candidates. Well, I'm guilty of doing exactly this a few times.
- groundzeros2015 1 month ago
  
  Meh. as an interviewer I would always make it clear if we wanted to switch to “let’s pretend it doesn’t fit on a machine now”.
  Demonstrating competency is always good.
dehrmann 1 month ago
> but that's not the answer we wanted
You could have learned this if you were better about collecting requirements. You can tell the interviewer "I'd do it like this for this size data, but I'd do it like this for 100x data. Which size should I design this for?" If they're looking for one direction and you ask which one, interviewers will tell you.
- jesse__ 1 month ago
  
  I've done that too and, in my experience, people that ask a scaling question that fits on a single machine don't have the capacity to have that nuanced conversation. I usually try to help the interviewer adjust the scale to something that actually requires many machines, but they usually don't get it.
  Said another way, how do you have a meaningful conversation about scaling with a person who thinks their application is huge, but in reality only requires a tiny fraction of a single machine? Sometimes, there's such a massive gulf between perception and reality that the only thing to do is chuckle and move on.
  
  1 reply →
coliveira 1 month ago

Yes, but then how are these people going to justify the money they're spending on cloud systems?... They need to find only reasons to maintain their "investment", otherwise they could be held as incompetent when their solution is proven to be ineffective. So, they have to show that it was a unanimous technical decision to do whatever they wanted in the first place.
winrid 1 month ago

I've actually worked on distributed systems that were so broken, I created a script to connect to prod and just create the report from my laptop. My manager offered to buy me a second laptop for running the report since it was easier than getting approval from the architects to get rid of the distributed report system (it only created that one report).
colechristensen 1 month ago
Yeah I had this problem at a couple of times in startup interviews where the interviewer asked a question I happened to have expertise in and then disagreed with my answer and clearly they didn't know all that much about it. It's ok, they did me a favor.
It may or may not be related that the places that this happened were always very ethnically monotone with narrow age ranges (nothing against any particular ethnic group, they were all different ethnic monotones)
- jesse__ 1 month ago
  
  Hah yeah, that's a funny one, being able to run circles around the interviewer.
ytoawwhra92 1 month ago
> I explained, from first principals, how it fits, and received feedback along the lines of "our engineers agreed with your technical assessment, but that's not the answer we wanted, so we're going to pass". I've had this experience a good handful of times.
Probably a better outcome than being hired onto a team where everyone know you're technically correct but they ignore your suggestions for some mysterious (to you) reason.
- jesse__ 1 month ago
  
  Oh, absolutely.
anshumankmr 1 month ago
Though I do not know the situation AT the firm you were interviewing in, if there is some unexpected increase in data volume OR say a job fails on certain days or you need to do some sort of historical data load (>= 6 months of 1 gig of data per day), the solution for running it on a single VM might not scale. BUT again, interviews are partially about problem solving, partially about checking compliance at least for IC roles (IN my anecdotal experience).
That being said yeah I too have done some similar stuff where some data engineering jobs could be run on a single VM but some jobs really did need spark, so the team decision was to fit the smaller square peg into a larger square peg and call it a da.In fact, I had spent time refactoring one particular pivotal job to run as an API deployed on our "macrolith" and integrated with our Airflow but it was rejected, so I stopped caring about engineering hygiene.
- johndough 1 month ago
  
  (>= 6 months of 1 gig of data per day)
  You can parse JSON at several GB/s: https://github.com/simdjson/simdjson And you could scale that by one or two orders of magnitude with thread-based parallelism on recent AMD Epyc or Intel Xeon CPUs. So parsing alone should not pose a problem (maybe even sub-second for 6 months of data). We would need a more precise problem statement to judge whether horizontal scaling is needed.
  
  1 reply →
- jesse__ 1 month ago
  
  As other commentors pointed out, 1gb/day isn't a problem for storage and retroactive processing until you get to like, hundreds of years of data. You can chew through a few hundred TB of JSON data in a day, per core + nvme drive.
  Regardless, storage and retroactive processing wasn't part of the problem. The problem was explicitly "parse json records as they come in, in a big batch, and increment some integers in a database".
  I'm not going to figure out what the upper limit is on a single bare-metal machine, but you can be damn sure it's a metric fuck-ton higher than 1gb/day. You can do a lot with a 10TB of memory and 256 cores.
- wongarsu 1 month ago
  
  If we are talking about cloud VMs: sure, their cpu performance is atrocious and io can be horrible. This won't scale to infinity
  But if there's the option to run this on a fairly modest dedicated machine, I'd be comfortable that any reasonable solution for pure ingest could scale to five orders of magnitude more data, and still about four orders of magnitude if we need to look at historical data. Of course you could scale well beyond that, but at that point it would be actual work
- ahoka 1 month ago
  
  “6 months of 1 gig of data per day”
  Then you would need an enormous 2TB storage device. \s
badgersnake 1 month ago

This kind of bad interview is rife. It’s often more a case of guess what the interviewer thinks than come up with a good solution.
franciscop 1 month ago
I have a funny story I need to tell some day about how I could get a 4GB JSON loaded purely in the browser at some insane speed, by reading the bytes, identifying the "\n" then making a lookup table. It started low stakes but ended up becoming a multi-million internal project (in man-hours) that virtually everyone on the company used. It's the kind of project that if started "big" from the beginning, I'd bet anything it wouldn't have gotten so far.
Edit: I did try JSON.parse() first, which I expected to fail and it did fail BUT it's important that you try anyway.
- mr_toad 1 month ago
  
  Curious about which browser and hardware. In my experience browsers often choke on 0.5GB strings, or decide to kill the tab/proccess.
  
  1 reply →
ahartmetz 1 month ago
Every one of these cores is really fast, too!
- jesse__ 1 month ago
  
  yeah man, computers are completely bananacakes
ahoka 1 month ago

They wanted to see if you would be on board with their embezzlement scheme.
sharadov 1 month ago

Yes, yes but how are we going to get HA with one machine..
Fuck off ..you're 10 person startup with an MVP and no revenue stream needs customers first..
yieldcrv 1 month ago

“there’s no wrong answer, we just want to see how you think” gaslighting in tech needs to be studied by the EEOC, Department of Labor, FTC, SEC, and Delaware Chancery Court to name a few
let’s see how they think and turn this into a paid interview
jitl 1 month ago

1gb of json u can do in one parse ¯\_(ツ)_/¯ big batches are fast

pocketarc 1 month ago

I agree - and it's not just what gets you promoted, but also what gets you hired, and what people look for in general.

You're looking for your first DevOps person, so you want someone who has experience doing DevOps. They'll tell you about all the fancy frameworks and tooling they've used to do Serious Business™, and you'll be impressed and hire them. They'll then proceed to do exactly that for your company, and you'll feel good because you feel it sets you up for the future.

Nobody's against it. So you end up in that situation, which even a basic home desktop would be more than capable of handling.

jrjeksjd8d 1 month ago
I have been the first (and only) DevOps person at a couple startups. I'm usually pretty guilty of NIH and wanting to develop in-house tooling to improve productivity. But more and more in my career I try to make boring choices.
Cost is usually not a huge problem beyond seed stage. Series A-B the biggest problem is growing the customer base so the fixed infra costs become a rounding error. We've built the product and we're usually focused on customer enablement and technical wins - proving that the product works 100% of the time to large enterprises so we can close deals. We can't afford weird flakiness in the middle of a POC.
Another factor I rarely see discussed is bus factor. I've been in the industry for over a decade, and I like to be able to go on vacation. It's nice to hand off the pager sometimes. Using established technologies makes it possible to delegate responsibility to the rest of the team, instead of me owning a little rats nest fiefdom of my own design.
The fact is that if 5k/month infra cost for a core part of the service sinks your VC backed startup, you've got bigger problems. Investors gave you a big pile of money to go and get customers _now_. An extra month of runway isn't going to save you.
- woooooo 1 month ago
  
  The issue is when all the spending gets you is more complexity, maintenance, and you don't even get a performance benefit.
  I once interviewed with a company that did some machine learning stuff, this was a while back when that typically meant "1 layer of weights from a regression we run overnight every night". The company asked how I had solved the complex problem of getting the weights to inference servers. I said we had a 30 line shell script that ssh'd them over and then mv'd them into place. Meanwhile the application reopened the file every so often. Zero problems with it ever. They thought I was a caveman.
  
  10 replies →
- SJC_Hacker 1 month ago
  
  In my experience, that $5k/month easily blows up into $100k/month
pragma_x 1 month ago
I've seen the ramifications of this "CV first" kind of engineering. Let's just say that it's a bad time when you're saddled with tech debt solely from a handful of influential people that really just wanted to work elsewhere.
- bandrami 1 month ago
  
  I'm largely a stranger to the js world but from the outside it sure looks like projects are sharded so as to maximize npm contribution count
exac 1 month ago

This. It is resume-driven development. Especially at startups where the engineers aren't compensated well enough or don't believe the produce can succeed.
atomicnumber3 1 month ago
I'm convinced k8s is a conspiracy by bigtech to suppress startups.
- hunterpayne 1 month ago
  
  So its the EJBs of this age then?

wccrawford 1 month ago

I've spent my last 2 decades doing what's right, using the technologies that make sense instead of the techs that are cool on my resume.

And then I got laid off. Now, I've got very few modern frameworks on my resume and I've been jobless for over a year.

I'm feeling a right fool now.

port11 1 month ago

I’m not hiring anymore, but when I was, all I wanted to find was someone that knew the fundamentals (and was a good ’attitude fit’ as per the similarly titled book). Sorry @wccrawford, I wish we could have more places that value slow, boring tech — aside from banking/insurance?
hackthemack 1 month ago

I have hung on to my job for many years now because of being in a similar situation in regards to trying to do the right thing and the fear of not being hire-able.
There is something wrong with the industry in chasing fads and group think. It has always been this way. Businesses chased Java in the late 90s, early 00s. They chased CORBA, WSDL, ESB, ERP and a host of other acronyms back in the day.
More recently, Data Lake, Big Data, Cloud Compute, AI.
Most of the executives I have met really have no clue. They just go with what is being promoted in the space because it offers a safety net. Look, we are "not behind the curve!". We are innovating along with the rest of the industry.
Interviews do not really test much for ability to think and reason. If you ran an entire ISP, if you figured out, on your own, without any help, how to shard databases, put in multiple layers of redundancy, caching... well, nobody cares now. You had to do it in AWS or Azure or whatever stack they have currently.
Sadly, I do not think it will ever be fixed. It is something intrinsic to human nature.
ted_dunning 1 month ago
You can fix that with some open source work and home projects.
Then, in the interview, you say the first line of your posting here and the last and then add that you fixed the problem with intensive study.
- wccrawford 1 month ago
  
  Yeah, I probably need to push this harder now. I did actually join 1 project recently and got to the point that I felt I could add 1 more common thing to my resume, and that felt good. (Getting something done felt good, too.)
  But getting to the point that I feel confident in certain frameworks is going to be hard. I'll figure it out somehow, I'm sure.
fHr 1 month ago
This exactly, actual doers are most of the time not rewarded meanwhile the AWS senior sucking Jeffs wiener specialist gets a job doing nothing but generating costs and leave behind more shit after his 3 years moving the ladder up to some even bigger bs pretend consulting job at an even bigger company. It's the same bs mostly for developers. I rewrite their library from TS to Rust and it gains them 50x performance increases and saves them 5k+ a week over all their compute now but nobody gives a shit and I do not have a certification for that to show off on my LinkedIn. Meanwhile my PM did nothing got paid to do some shity certificate and then gets the credit and the certificate and pisses of to the next bigger fish collecting another 100k more meanwhile I get a 1k bonus and a pat on the shoulder. Corporate late stage capitalism is complete fucking bs and I think about becoming a PM as well now. I feel like a fool and betrayed. Meanwhile they constantly threaten our Team to lay it off or outsource it as they say we are to expensive in a first world country and they easily find as good people in India etc. What a time to be alive.
- antonvs 1 month ago
  
  > saves them 5k+ a week over all their compute
  If you're willing and able to promote yourself internally, you can make people give a shit, or at least publicly claim they do. That's 260k+ per year, and even big businesses are going to care about that at some level, especially if it's something that can be replicated. Find 10 systems you can do that with, and it's 2.6m+ per year.
  But, if you don't want to play the self-promotion game, yeah someone else is going to benefit from your work.
ahartmetz 1 month ago

Try Rust? The system programming world isn't very bullshit-infested and Rust is trendy (which is good for a change), also employers can't realistically expect many years of Rust experience.
Need training and something to show? Contribute to some FOSS project.

nicoburns 1 month ago

> datasets that often fit entirely in RAM.

Yep, and a lot more datasets fit entirely into RAM now. Ignoring the recent price spikes for a moment, 128GB of RAM in a laptop is entirely achievable and not even the limit of what is possible. That was a pipe dream in 2014 when computers with only 4GB were still common. And of course for servers the max RAM is much higher, and in a lot of scenarios streaming data off a fast local SSD may be almost as good.

dapperdrake 1 month ago

Oldie-but-goldy:
https://yourdatafitsinram.net/
plagiarist 1 month ago

You don't really need to ignore the price spikes even. You can still buy more than 128Gb RAM on a machine with the $5k from one of the months.
newyankee 1 month ago

I have actually worked in a company as a consultant data guy in a non technical team, I had a 128 GB PC 10 years back, and did everything with open source R then, and it worked ! The others thought it was wizardry

reval 1 month ago

I’ve seen this pattern play out before. The pushback on simpler alternatives seems from a legitimate need for short time to market from the demand some of the equation and a lack of knowledge on the supply side. Every time I hear an engineer call something hacky, they are at the edge of their abilities.

networkadmin 1 month ago
[flagged]
- acdha 1 month ago
  
  systemd would be a derail even if you weren’t misrepresenting the situation at several levels. Experienced sysadmins in my experience were the ones pushing adoption because they had to clean up the messes caused by SysV’s design limitations and flaws, whereas in this case it’s a different scenario where the extra functionality is both unneeded and making it worse at the core task.
  
  11 replies →
- dapperdrake 1 month ago
  
  Eternal September
  
  3 replies →

RobinL 1 month ago

Worse in some ways, better in others. DuckDB is often an excellent tool for this kind of task. Since it can run parallelized reads I imagine it's often faster than command line tool, and with easier to understand syntax

briHass 1 month ago
More importantly, you have your data in a structured format that can be easily inspected at any stage of the pipeline using a familiar tool: SQL.
I've been using this pattern (scripts or code that execute commands against DuckDB) to process data more recently, and the ability to do deep investigations on the data as you're designing the pipeline (or when things go wrong) is very useful. Doing it with a code-based solution (read data into objects in memory) is much more challenging to view the data. Debugging tools to inspect the objects on the heap is painful compared to being able to JOIN/WHERE/GROUP BY your data.
- groundzeros2015 1 month ago
  
  Yep. It’s literally what SQL was designed for, your business website can running it… the you write a shell script to also pull some data on a cron. It’s beautiful
mrgoldenbrown 1 month ago
IMHO the main point of the article is that typical unix command pipeline pipeline IS parallelized already.
The bottleneck in the example was maxing out disk IO, which I don't think duckdb can help with.
- chuckadams 1 month ago
  
  Pipes are parallelized when you have unidirectional data flow between stages. They really kind of suck for fan-out and joining though. I do love a good long pipeline of do-one-thing-well utilities, but that design still has major limits. To me, the main advantage of pipelines is not so much the parallelism, but being streams that process "lazily".
  On the other hand, unix sockets combined with socat can perform some real wizardry, but I never quite got the hang of that style.
  
  2 replies →

czhu12 1 month ago

I think it’s not so much engineers actually setting up a distributed compute, as it is dropping a credit card into a paid cloud service, which behind the scenes sets up a distributed compute cluster and bills you for the compute in an obfuscated way, then gives a 20% discount + SSO if you sign up for annual enterprise plan.

This kind of practice is insidious because early on, they charge $20/month to get started on the first 100mb of log ingestion, and you can have it up and running in 30 seconds with a credit card. Who would turn that down?

Revisit that set up 2 years later and it’s turned into a 60k/y behemoth that no one can unwind

attractivechaos 1 month ago

On the contrary, the key message from the blog post is not to load the entire dataset to RAM unless necessary. The trick is to stream when the pattern works. This is how our field routinely works with files over 100GB.

willtemperley 1 month ago

Yep. The cloud providers however always get paid, and get paid twice on Sunday when the dev-admins forget to turn stuff off.

It’s the same story as always, just it used to be Oracle certified tech, now it’s the AWS tech certified to ensure you pay Amazon.

lormayna 1 month ago

For a dasaset that live in RAM, the best solution are DuckDB or clickhouse-local. Using SQLish data is easier than a bunch of bash script and really powerful.

zX41ZdbW 1 month ago
Though ClickHouse is not limited to a single machine or local data processing. It's a full-featured distributed database.
- exagolo 1 month ago
  
  Another alternative is Exasol that is factors (>10x) faster than Clickhouse and scales much better for complex analytics workloads that joins data. There is a free edition for personal use without data limit that can run on any number of cluster nodes.
  If you just want to read and analyze single table data, then Clickhouse or DuckDB are perfect.
  Disclaimer: I work at Exasol

hmokiguess 1 month ago

This reminds me of this reddit comment from a long time ago: https://www.reddit.com/r/programming/comments/8cckg/comment/...

data-ottawa 1 month ago

Airflow and dbt serve a real purpose.

The issue is you can run sub tib jobs on a few small/standard instances with better tooling. Spark and Hadoop are for when you need multiple machines.

Dbt and airflow let you represent your data as a DAG and operate on that, which is critical if you want to actually maintain and correct data issues and keep your data transforms timely.

edit: a little surprised at multiple downvotes. My point is, you can run airflow and dbt on small instances, and you can do all your data processing on small instances with tools like duckdb or polars.

But it is very useful to use a tool like dbt that allows you to re-build and manage your data in a clear way, or a tool like airflow which lets you specify dependencies for runs.

After say 30 jobs or so, you'll find that being able to re-run all downstreams of a model starts to payoff.

adammarples 1 month ago

Agreed, airflow and dbt have literally nothing to do with the size of the data and can be useful, or overkill, at any size. Dbt just templates the query strings we use to query the data and airflow just schedules when we query the data and what we do next. The fact that you can fit the whole dataset in duckdb without issue is kind of separate to these tools, we still need to be organised about how and when we query it.
x0x0 1 month ago

dbt is super useful for building a dag and managing pieces of it that update on different schedules. eg with one dataset that's refreshed monthly and another daily, you can only rebuild the daily one unless the slower-cadence input has a new update.

petcat 1 month ago

> a robust bash script

These hardly exist in practice.

But I get what you mean.

sam_lowry_ 1 month ago
Yoy don't. It's bash only because the parent process is bash, but otherwise it's all grep, sort, tr, cut and othe textutils piped together.
- mjevans 1 month ago
  
  awk can do some heavy lifting too if the environment is too locked down to import a kitchen sink of python modules.

dgxyz 1 month ago

Our lot burns a fortune on snowflake every month but no one is using it. Not enough data is being piped into it and the shitty old reports we have which just run some SQL work fine.

It looked good on someone’s resume and that was it. They are long gone.

kcexn 1 month ago

Because developers are incentivized to have marketable software skills. Not marketable build things that are cheap and profitable skills.

Moore's law was supposed to make it simpler and cheaper to do more computationally expensive tasks. But in the meantime, everyone kept inflating the difficulty of a task faster than Moore could keep up.

I think some of this is because of the incredible amounts of capital that startups seem to be able to acquire. If startups had to demonstrate profitability before they were given any money to scale, the story would be very different I think.

jmye 1 month ago

> because setting up a 'Modern Data Stack' is what gets you promoted

It’s not just that, it’s that you better know their specific tech stack to even get hired. It’s a lot of dumb engineering leaders pretending that AWS, Azure and Snowflake are such wildly different ecosystems that not having direct experience in theirs is disqualifying (for pure DE roles, not talking broader sysadmin).

The entire data world is rife with people who don’t have the faintest clue what they’re doing, who really like buzzwords, and who have never thought about their problem space critically.

rawgabbit 1 month ago

Well. I try for a middle ground. I am currently ditching both airflow and dbt. In Snowflake, I use scheduled tasks that call stored procedures. The stored procedures do everything I need to do. I even call external APIs like Datadog’s and Okta’s and pull down the logs directly into snowflake. I do try to name my stored procedures with meaningful names. I also add generous comments including urls back to the original story.

rawgabbit 1 month ago

I forgot to mention in Snowflake, besides chron scheduled tasks, you can add dependent tasks that only run if the previous task succeeded. I have 40 tasks chained together that way. Each of my task calls a stored procedure. Within each procedure, I have Try Catch and a catch-all clause that Raiseerror.

1vuio0pswjnm7 1 month ago

"I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'."

Also seen strange responses from HN commenters when it's mentioned that bash is large and slow compared to ash and bash is better suited for use as an interactive shell whereas ash is better suited for use as a non-interactive shell, i.e., a scripting shell

I also use ash (with tabcomplete) as an interactive shell for several reasons

awesome_dude 1 month ago

ENG are building what MGMT has told them to build for, the scale they want, not the scale they have

vjvjvjvjghv 1 month ago

I see this at work too. They are ingesting a few GB per day but running the data through multiple systems. So the same functionality we delivered with a python script within a week now takes months to develop and constantly breaks.

jitl 1 month ago

On the other hand, now we have duckdb for all the “small big data”, and a slew of 10-100x faster than Java equivalent stuff in the data x rust ecosystem, like DataFusion, Feldera, ByteWax, RisingWave, Materialize etc

groundzeros2015 1 month ago
The point of the article is those don’t actually work that well.
I guarantee those rust projects have spent more time playing with rust and library design than the domain problem they are trying to solve.
- jitl 1 month ago
  
  None of the systems I mentioned existed at the time the article was published. I think the author would love duckdb which is a very speedy CLI SQL thingy that reads and writes data in all sorts of formats. It fits in great with other Unix CLI stuff.
  Many of the projects I mentioned you could see as a response to OP and the 2015 “Scalability, but at what COST?” paper which benchmarked distributed systems to see how many cores they need to beat a single thread. (https://news.ycombinator.com/item?id=26925449)
  
  2 replies →
hunterpayne 1 month ago
I call BS on those Rust 10-100x claims. Rust and Java are roughly equal in performance. It is just that there are a lot of old NoSQL frameworks in Java which are trash. I also checked out those companies, some of which are doing interesting stuff. None claim things are 100x faster because of Rust. You just hurt your credibility when you say such clearly false things. That's how you end up with a Hadoop cluster which is 236x slower than a batch script.
PS None of the companies you linked seem to be using a datapath architecture which is the key to the highest level of performance
- jitl 1 month ago
  
  It wasn’t my intention to say “this stuff is 100x faster because rust”. DuckDB is C++. My intention was to draw distinction between the Java/Hadoop era of cluster and data systems, and the 2020s era of cluster and data systems, much of which has designs informed by stuff like this article / “Scalability but at what COST?”. I guess instead of “faster” I should say “more efficient”.
  For example, the Kafka ecosystem tends to use Avro as the data transfer serialization, which needs a copy/deserialization step before it can be used in application logic. Newer stream systems like Timely tend to use zero-copy capable data transfer formats (timely’s is called Abomination) but it’s the same idea in CapnProto or Flatbuffers - it’s infinity faster to not copy the data as you decode! In my experience this kind of approach is more accessible in systems languages like C++ or Rust, and harder to do in GC languages where the default approach to memory layout and memory management is “don’t worry about it.”

mritchie712 1 month ago

happy middle ground: https://www.definite.app/ (I'm the founder).

datalake (DuckLake), pipelines (hubspot, stripe, postgres), and dashboards in a single app for $250/mo.

marketing/finance get dashboards, everyone else gets SQL + AI access. one abstraction instead of five, for a fraction of your Snowflake bill.

shiandow 1 month ago

If airflow is a layer of abstraction something is wrong.

Yes it is an additional layer, but if your orchestration starts concerning itself with what it is doing then something is wrong. It is not a layer on top of other logic, it is a single layer where you define how to start your tasks, how to tell when something is wrong, and when to run them.

If you don't insist on doing heavy compitations within the airflow worker it is dirt cheap. If it's something that can easily be done in bash or python you can do it within the worker as long as you're willing to throw a minimal amount of hardware at it.