Comment by diarrhea

5 months ago

I’m curious about the async aspect of this. I was under the impression PDF processing like OCR is purely CPU bound. OS file I/O interfaces are sync, so async does not help. With GIL, so single threaded Python, I can’t see how async improves performance for the PDF use case. Only parallelism helps, and concurrency doesn’t. When would it yield back to the event loop when it’s busy number crunching?

11 comments

diarrhea

nhirschfeld 5 months ago

Thanks for asking!

It's both. The OCR part is ofc CPU bound, but the entire text extraction involves reading files, or writing and then reading files.

Without async, these simply block.

As for efficiency - if you're working in an async application context you have to "asyncify" these operations or suffer the consequences.

skavi 5 months ago
in that case, what’s the deal with extract_bytes being async? i’m not incredibly familiar with python, but i’d expect a “byte string” to be in memory.
- nhirschfeld 5 months ago
  
  You still need to write it to file to process it via pandoc/tesseract etc.
  There are alternative options to tesseract ofc.
  
  1 reply →

nurettin 5 months ago

It just litters perfectly reasonable python code with async/await. Maybe they are preparing for something we don't know, like a parallel async executor which can be set up to use native threads without changing code and somehow protects you if it detects shared state.

hermitdev 5 months ago
Caveat: I have not looked at the neither the API nor the implementation of Kreuzberg, this is purely from personal work.
Even with CPU bound code in Python, there are valid reasons to be using async code. Recognizing that the code is CPU bound, it is possible to use thread and/or process pools to achieve a certain level of parallelism in Python. Threading won't buy you much in Python, until 3.13t, due to the GIL. Even with 3.12+ (with the GIL enabled), it's possible (but not trivial) to use threading with sub interpreters (that have their own, separate GIL). See PEP 734 [0].
I'm currently investigating the use of sub interpreters on a project at work where I'm now CPU bound. I already use multiprocessing & async elsewhere, but I am curious if PEP 734 is easier/faster/slower or even feasible for me. I haven't gotten as far as to actually run any code to compare (I need to refactor my code a bit with the idea of splitting the work up a bit differently to account for being CPU instead of just IO bound).
[0] https://peps.python.org/pep-0734/
- impoppy 5 months ago
  
  Will it lock the GIL if you use thread executor with asyncio for a native c / ffi extension? If that’s the case, that would also add to benefits of asyncio.
diarrhea 5 months ago
> It just litters perfectly reasonable python code with async/await
Yeah. As an API consumer I would not expect a PDF API do IO, hence be async. Have the library be sans-io, the interfaces sync and callers from async code handle IO on their end, offloading to IO threads.
Async is also referred to as “best practice”, but it’s just a tool, for specific use cases. And I say that as an “async fan”!
That said, perhaps it’s easier nowadays to just do async by default, as you say. The real world is async anyway, so why not program closer to that reality.
- nhirschfeld 5 months ago
  
  thats why Kreuzberg also exposes a sync API for you to consume.
- PDFBolt 5 months ago
  
  Async is great when you truly need it, but it can overcomplicate things when misused. Having both sync and async options, seems like the best approach. Lets devs choose based on their needs rather than forcing one paradigm.

ismailmaj 5 months ago

It is probably not worth the complexity currently but considering they are using small local CPU models for OCR like tesseract, if they add the support of reading files on the web then I wouldn't be so sure of the CPU bound aspect.