← Back to context

Comment by geertj

4 days ago

I want to build a "tokio for C++" where the user can run CPU bound tasks freely without having to worry about introducing tail latencies. For now I gave it the code name Rio, but that may change.

This is at the very early stages where I have a design sketch and some experiments that validate the design. Below is the README:

Rio is an experimental C++ async framework. The goal is to provide a lightweight framework for writing C++ server applications that is easy to use and provides low consistent latencies.

Today, async frameworks that focus on efficiency typically use one of two architectures:

1. Shared nothing architectures, also called thread-per-core. This is used by frameworks such as Seastar and Boost.Asio. In a shared-nothing architecture, each worker thread runs its own event loop and is intended to run on its own dedicated core. The application is architected to shard its workload over multiple workers with only infrequent communication between them. When a task performs a CPU bound task, it needs to be explicitly run in a thread pool as otherwise it would block other tasks from running in the current worker (often referred to as a "reactor stall").

2. Work stealing architectures. This architecture is used by frameworks such as Tokio. In this case there are also multiple worker threads, each running their own event loop. When a specific worker gets overloaded or runs a blocking task, other threads can execute ready tasks. This goes some way to prevent reactor stalls. However, even though other threads can steal ready tasks, they do not poll the event loop for new readiness events. This means that a task that does not yield back to the runtime will increase latencies for other requests assigned to that worker.

The thesis for Rio is that in real-world server applications, it gets increasingly hard to ensure you yield back frequently to the event loop. In particular, there are many CPU bound tasks that server applications commonly perform, such parsing protocols, or performing encryption and compression. If tasks take less than ~10 microseconds it is often not worth it to offload these to a thread pool as the system call overhead of synchronization will take more time than this. Additionally, putting in various thread offloads makes it harder to develop, especially in larger teams with individuals of different experience levels. The result is that there will either be too many work pushed on thread pools or too little. The net effect will be that latencies will be less consistent.

Rio is an experiment for a work stealing architecture where completion events can also be stolen. The Rio runtime uses multiple worker threads to handle asynchronous tasks. Each worker threads runs its own io_uring, which is also registered to an eventfd. A central "stealer" thread listens to the eventfds for all workers. When an readiness event becomes available, the stealer will check if the corresponding thread is currently executing a task. If so, it will signal an idle worker with a request to process the completion event and run any tasks that results in it. The stealing logic is aware of the system topology and will try to wake up a thread that shares a higher level cache with the task.