Comment by jamesblonde

12 hours ago

I gave a talk at PyData Berlin on how to build your own TikTok recommendation algorithm. The TikTok personalized recommendation engine is the world's most valuable AI. It's TikTok's differentiation. It updates recommendations within 1 second of you clicking - at human perceivable latency. If your AI recommender has poor feature freshness, it will be perceived as slow, not intelligent - no matter how good the recommendations are.

TikTok's recommender is partly built on European Technology (Apache Flink for real-time feature computation), along with Kafka, and distributed model training infrastructure. The Monolith paper is misleading that the 'online training' is key. It is not. It is that your clicks are made available as features for predicitons in less than 1 second. You need a per-event stream processing architecture for this (like Flink - Feldera would be my modern choice as an incremental streaming engine).

* https://www.youtube.com/watch?v=skZ1HcF7AsM

* Monolith paper - https://arxiv.org/pdf/2209.07663

I have to say, it is _extremely_ impressive when a tiktok I watched reminds me of some other tiktok, so I go and search for a very loose description of the tiktok, and the first result is 95% of the time what I wanted to find.

I don't think any single other platform has as good a search feature as TikTok does.

  • oh wow, you're really lucky. around my friend groups who use tiktok, the main complaint is how bad the search is. unfortunately for us, getting a specific video is almost impossible =(

I noticed Youtube shorts also seems to update the feed based on how long the last video you watched. If you're scrolling quickly then stop to watch a dog video long enough the next one is likely to be another dog video.

  • I’ve noticed the same thing and this creates such a negative user experience. Every short is a reaction test and if I fail, I get slop. Makes the whole experience very jarring (for better or for worse).

  • Facebook does the same. The longer I dwell on an image post, the more likely the next batch of posts would be similar

    • The right way to look at these networks is that people are being trained by the algorithm, not the other way around. The ultimate goal is to elicit behaviors in humans, normally to spend more time and spend more money in the platform, but also for other goals that may be designed by the owners of the network.

    • Is amazon using the same thing??? I can't count the number of times I am getting recommended the EXACT same type of product I just purchased.

  • One of my gripes with youtube at the moment is that they break my adblock filters to remove shorts more often than they break the filters stopping the actual ads.

Flink is too slow for this.

If by features you mean tracking state per user, that stuff can be tracked without Flink insanely fast with Redis as well.

If you re saying they dont have to load data to update the state, I dont see how massive these states are to require inmemory updates, and if so, you could just do inmemory updates without Flink.

Similarly, any consumer will have to deal with batches of users and pipelining.

Flink is just a bottleneck.

If they actually use Flink for this, its not the moat.

  • Yea, the Monolith paper by Bytedance uses Flink but they only say it's in use for their B2B ecommerce optimization system. Maybe this is intentional ambiguity, but I'd believe that they wouldn't rely on something like Flink for their core TikTok infrastructure.

    My hunch is we start to learn a lot more about the core internals as Oracle tries to market to B2B customers, as Oracle is wont to do!

    • Flink is not really a performance choice, it's bloat to throw software as fast as possible at problems. I don't think there's any benchmark demonstrating insane capabilities per machine. I definitely couldn't get it to any numbers I liked, given other stream processing / state processing engines that exist (if compute and inmemory state management is the goal). Pretty sure any pathway that touches RocksDB slows everything down to 1-10k events per second, if not less.

      The problem of finding out which video is next, by immediately taking into account the recent user context (and other user context) is completely unrelated to what Flink does -- exactly-once state consistency, distributed checkpoints, recovery, event-time semantics, large keyed state. I would even say you don't want a solution to any of the problems Flink solves, you want to avoid having these problems.

I’m happy to see that Flink is in this stack, I wish that Pulsar was as well instead of Kafka.

It's interesting to how they found out the "lifetime" of features is a feature by itself. Meta features is real.

TikTok's differention is the userbase of all teenagers in the world.

  • But go just one layer deeper to 'why is every teenager using Tiktok' and the primary answer once again becomes 'Tiktok's recommendation engine'

    • I'm not a TikTok user, but I'm assuming the recommendation engine is there to keep eyeballs on more ads for longer. Maybe we should be regulating how often and how many ads can be shown on social media, especially to teens and kids.

    • No the primary answer is "teenagers do what other teenagers do". Remember we are advanced apes no more no less.

      There is this curious word "influencer" which everyone uses but few ever think about what it really means.

  • It also provides different opportunities for growth compared to other social media. A video that gets over half a million views on TikTok may not get 5 thousand on Youtube, or even 10 views on Instagram or Facebook.

    • Isn't the inverse true though? it's not as if nobody's watching youtube, it's just that different videos are popular there.