Comment by Uptrenda
3 years ago
Anyone who has ever pulled a website from a script knows the pain that is Javascript. Normally you want to just get some text and work out the API actions but a lot of sites use horribly obfuscated Javascript -- either because that's what modern web development is (lolz) -- or because its part of their 'security.' That means if you want to write browser-based bots properly -- you ought to use a browser. There are special browsers that run 'headlessly' or are designed mostly for bot use. Like https://www.selenium.dev/ which plugs into a few different 'browser engines.'
But now you have another problem. Your simple script goes from being small, simple, self-contained, and elegant gem, to requiring a full browser, specialized drivers, and/or daemons running just to work. If you're using something like Python you just frankly don't have very good packaging. So it's hard to string together all that into a solution and have it magically work for everyone. What YouTube-dl have done is good engineering. Even though it's not a full JS interpreter: they've kept their software lean, self-contained, and easier to use.
Embedding V8 can work quite well: https://github.com/sqreen/PyMiniRacer
You probably have to emulate some of the DOM, but you can interact directly with whatever obfuscated/packed scripts in a more lightweight and secure way than driving an entire browser.
I use pyminiracer to great effect for that sort of scraping.
Just npm install puppeteer.
Puppeteer is cool, but it's exactly what OP is warning against: it's a full browser that is downloaded and run through npm. It's remarkably well packaged, but still far more error prone than a simple HTTP request, and far more likely to break on its own just with the passage of time.
Yes, but:
”Your simple script goes from being small, simple, self-contained, and elegant gem, to requiring a full browser, specialized drivers, and/or daemons running just to work”
Complex problems cannot be solved by simple scripts, but they can be abstracted away to vendor libraries when/if they are well maintained, such as in this case. While it can break with time, at least someone else fixes it for you.
There's also puppeteer-core which lets you use your own (Google Chrome) browser and if your own browser is broken then you're having bigger problems than youtube-dl not working.
By the way there is also Playwright [1] and it has Python bindings too [2].
[1]: https://playwright.dev/
[2]: https://playwright.dev/python/docs/intro