← Back to context

Comment by LatencyKills

6 days ago

I've been working on a utility that lets me "see through" app windows on macOS [1] (I was a dev on Apple's Xcode team and have a strong understanding of how to do this efficiently using private APIs).

I wondered how Claude Code would approach the problem. I fully expected it to do something most human engineers would do: brute-force with ScreenCaptureKit.

It almost instantly figured out that it didn't have to "see through" anything and (correctly) dismissed ScreenCaptureKit due to the performance overhead.

This obviously isn't a "frontier" type problem, but I was impressed that it came up with a novel solution.

[1]: https://imgur.com/a/gWTGGYa

That's actually pretty cool. What made you think of doing this in the first place?

  • Thanks! I've been doing a lot of work on a laptop screen (I normally work on an ultrawide) and got tired of constantly switching between windows to find the information I need.

    I've also added the ability to create a picture-in-picture section of any application window, so you can move a window to the background while still seeing its important content.

    I'll probably do a Show HN at some point.

Was it a novel solution for you or for everyone? Because that's a pretty big difference. A lot stuff novel for me would be something someone had been doing for decades somewhere.

  • Unless you worked on the macOS content server directly you’d have no idea that my solution was even possible.

    That fact that Claude skipped over all the obvious solutions is why I used the word novel.

    • How confident are you that this knowledge was not part of the training data? Was there no stackoverflow questions/replies with it, no tech forum posts, private knowledge bases, etc?

      Not trying to diminish its results, just one should always assume that LLMs have a rough memory on pretty much the whole of the internet/human knowledge. Google itself was very impressive back then in how it managed to dig out stuff interesting me (though it's no longer good at finding a single article with almost exact keywords...), and what makes LLMs especially great is that they combine that with some surface level transformation to make that information fit the current, particular need.

      4 replies →

Why is ScreenCaptureKit a bad choice for performance?

  • Because you can't control what the content server is doing. SCK doesn't care if you only need a small section of a window: it performs multiple full window memory copies that aren't a problem for normal screen recorders... but for a utility like mine, the user needs to see the updated content in milliseconds.

    Also, as I mentioned above, when using SCK, the user cannot minimize or maximize any "watched" window, which is, in most cases, a deal-breaker.

    My solution runs at under 2% cpu utilization because I don't have to first receive the full window content. SCK was not designed for this use case at all.

    • It's been a while since I looked at this but I'm not entirely sure I agree with this. ScreenCaptureKit vends IOSurfaces which don't have copies besides the one that happens to fill the buffer during rendering. I'm not entirely sure what other options you have that are better besides maybe portal views.

      1 reply →

What was the solution?

  • Well, I'm not going to share either solution as this is actually a pretty useful utility that I plan on releasing, but the short answer is: 1) don't use ScreenCaptureKit, and 2) take advantage of what CGWindowListCreateImage() offers through the content server. This is a simple IPC mechanism that does not trigger all the SKC limitations (i.e., no multi-space or multi-desktop support). In fact, when using SKC, the user cannot even minimize the "watched" window.

    Claude realized those issues right from the start.

    One of the trickiest parts is tracking the window content while the window is moving - the content server doesn't, natively, provide that information.