Comment by Ozzie_osman

10 months ago

I feel like everyone's journey with Kafka ends up being pretty similar. Initially, you think "oh, an append-only log that can scale, brilliant and simple" then you try it out and realize it is far, far, from being simple.

39 comments

Ozzie_osman

munksbeer 10 months ago

I'm not a fan or an anti-fan of kafka, but I do wonder about the hate it gets.

We use it for streaming tick data, system events, order events, etc, into kdb. We write to kafka and forget. The messages are persisted, and we don't have to worry if kdb has an issue. Out of band consumers read from the topics and persist to kdb.

In several years of doing this we haven't really had any major issues. It does the job we want. Of course, we use the aws managed service, so that simplifies quite a few things.

I read all the hate comments and wonder what we're missing.

benjaminwootton 10 months ago
That’s my experience too. I’ve deployed it more than ten times as a consultant and never really understood the reputation for complexity. It “just works.”
- amazingman 10 months ago
  
  I've deployed it a bunch of times and, crucially, maintained it thereafter. It's very complex, especially when troubleshooting pathological behavior or recovering from failures, and I don't see why anyone with significant experience with Kafka could reasonably claim otherwise.
  Kafka is perhaps the most aptly named software I've ever used.
  That said, it's rock solid and I continue to recommend it for cases where it makes sense.
  
  1 reply →
Bombthecat 10 months ago

How long is your Kafka down when you cut the cable to Kafka and it needs to fail over?
DrFalkyn 10 months ago
What happens if the Kafka node fails ?
- radiator 10 months ago
  
  "the" node? Kafka is a cluster of multiple nodes.
  
  1 reply →

carlmr 10 months ago

I'm wondering how much of that is bad developer UX and defaults, and how much of that is inherent complexity in the problem space.

Like the article outlines, partitions are not that useful for most people. Instead of removing them, how about having them behind a feature flag, i.e. not on by default. That would ease 99% of users problems.

The next point in the article which to me resonates is the lack of proper schema support. That's just bad UX again, not inherent complexity of the problem space.

On testing side, why do I need to spin up a Kafka testcontainer, why is there no in-memory kafka server that I can use for simple testing purposes.

gunnarmorling 10 months ago
> why is there no in-memory kafka server that I can use for simple testing purposes.
Take a look at Debezium's KafkaCluster, which is exactly that: https://github.com/debezium/debezium/blob/main/debezium-core....
It's used within Debezium's test suite. Check out the test for this class itself to see how it's being used: https://github.com/debezium/debezium/blob/main/debezium-core...
ahoka 10 months ago
I think it's just horrible software built on great ideas sold on a false premise (this is a generic message queue and if you don't use this you cannot "scale").
- mrkeen 10 months ago
  
  It's not just about the scaling, it's about solving the "doing two things" problem.
  If you take action a, then action b, your system will throw 500s fairly regularly between those two steps, leaving your user in an inconsistent state. (a = pay money, b = receive item). Re-ordering the steps will just make it break differently.
  If you stick both actions into a single event ({userid} paid {money} for {item}) then "two things" has just become "one thing" in your system. The user either paid money for item, or didn't. Your warehouse team can read this list of events to figure out which items to ship, and your payments team can read this list of events to figure out users' balances and owed taxes.
  (You could do the one-thing-instead-of-two-things using a DB instead of Kafka, but then you have to invent some kind of pub-sub so that callers know when to check for new events.)
  Also it's silly waiting around to see exceptions build up in your dev logs, or for angry customers to reach out via support tickets. When your implementation depends on publishing literal events of what happened, you can spin up side-cars which verify properties of your system in (soft) real-time. One side-car could just read all the ({userid} paid {money} for {item}) events and ({item} has been shipped) events. It's a few lines of code to match those together and all of a sudden you have a monitor of "Whose items haven't been shipped?". Then you can debug-in-bulk (before the customers get angry and reach out) rather than scour the developer logs for individual userIds to try to piece together what happened.
  Also, read this thread https://news.ycombinator.com/item?id=43776967 from a day ago, and compare this approach to what's going on in there, with audit trails, soft-deletes and updated_at fields.
- carlmr 10 months ago
  
  I kind of agree on the horrible software bit, but what do you use instead? And can you convince your company to use that, too?
  
  1 reply →
dxxvi 10 months ago
> why is there no in-memory kafka server that I can use for simple testing purposes https://github.com/embeddedkafka/embedded-kafka It's for scala. I'm trying to do something similar in Java but haven't got time yet.
- carlmr 10 months ago
  
  I was working on a node.js project, I saw that one, but it's only for JVM.

vkazanov 10 months ago

Yeah...

It took 4 years to properly integrate Kafka into our pipelines. Everything, like everything is complicated with it: cluster management, numerous semi-tested configurations, etc.

My final conclusion with it is that the project just doesn't really know what it wants to be. Instead it tries to provide everything for everybody, and ends up being an unbelievably complicated mess.

You know, there are systems that know what they want to be (Amazon S3, Postres, etc), and then there are systems that try to eat the world (Kafka, k8s, systemd).

EdwardDiego 10 months ago
> that the project just doesn't really know what it wants to be
It's a distributed log? What else is it trying to do?
- zaphirplane 10 months ago
  
  Calling it a distributed log may just be a Reductio ad absurdum
  
  1 reply →
pphysch 10 months ago

> You know, there are systems that know what they want to be (Amazon S3, Postres, etc), and then there are systems that try to eat the world (Kafka, k8s, systemd).
I am not sure about this taxonomy. K8s, systemd, and (I would add) the Linux kernel are all taking on the ambitious task of central, automatic orchestration of general purpose computing systems. It's an extremely complex problem and I think all those technologies have done a reasonably good job of choosing the right abstractions to break down that (ever-changing) mess.
People tend to criticize projects with huge scope because they are obviously complex, and complexity is the enemy, but most of the complexity is necessary in these cases.
If Kafka's goal is to be a general purpose "operating system" for generic data systems, then that explains its complexity. But it's less obvious to me that this premise is a good one.
1oooqooq 10 months ago
systemd knows very well what it wants to be, they just don't tell anyone.
it's real goal is to make Linux administration as useless as windows so RH can sell certifications.
tell me the output of systemctl is not as awful as opening the windows service panel.
- hbogert 10 months ago
  
  Tell me systemctl output isn't more beneficial than per distro bash-mess
  
  5 replies →
- chupasaurus 10 months ago
  
  There are 2 service panels in Windows since 8 and they are quite different...
  
  1 reply →

mrweasel 10 months ago

The worst part of Kafka, for me, is managing the cluster. I don't really like the partitioning and the almost hopelessness that ensues when something goes wrong. Recovery is really tricky.

Granted it doesn't happen often, if you plan correctly, but the possibility of going wrong in the partitioning and replication makes updates and upgrades nightmare fuel.

hinkley 10 months ago

There was an old design I encountered in my distributed computing class, and noticed in the world having been primed to look for it, where you break ties in distributed systems with a supervisor whose only purpose was to break ties. In a system that only need 2 or 4 nodes to satisfy demand, the cost of running a 3rd of 5th node only to break ties results in a lot of operational cost. So you created a process that understood the protocol but did not retain the data, whose sole purpose was to break split brain ties.
Then we settled into an era where server rooms grew and workloads demanded horizontal scaling and for the high profile users running an odd number of processes was a rounding error and we just stopped doing it.
But we also see this issue re-emerge with dev sandboxes. Running three copies of Kafka, Redis, Consul, Mongo, or god forbid all four, is just a lot for one laptop, and 50% more EC2 instances if you spin it up in the Cloud, one cluster per dev.
I don’t know much Kafka, so I’ll stick with Consul as a mental exercise. If you take something like consul, the voting logic should be pretty well contained. It’s the logic for catching up a restarted node and serving the data that’s the complex part.
EdwardDiego 10 months ago
Have a look at Strimzi, a K8s operator, gives you a mostly-managed Kafka experience.
- sgarland 10 months ago
  
  Now you have two problems.

hinkley 10 months ago

Once you pick an “As simple as possible, but no simpler” solution, it triggers Dunning Kruger in a lot of people who think they can one up you.

There was a time in my early to mid career when I had to defend my designs a lot because people thought my solutions were shallower than they were and didn’t understand that the “quirks” were covering unhappy paths. They were often load bearing, 80/20 artifacts.

kcexn 10 months ago

Having worked with it only a little on occasion. I found that the problem lies in its atrocious documentation.

I get it, there are lots of knobs and dials I can adjust to tune the cluster. A one-line description for each item is often insufficient to figure out what the item is doing. You can get a sense for the problem eventually if you spin up a local environment and one-by-one go through each item to see what it does, but that's super time consuming.

Hamuko 10 months ago

>Initially, you think "oh, an append-only log that can scale, brilliant and simple"

Really? I got scared by Kafka by just reading through the documentation.