Comment by wirthal1990
1 month ago
I'm working on a phytochemical dataset product. The base is USDA Dr. Duke's Phytochemical and Ethnobotanical Databases — 16 relational CSV tables that I denormalized into a single flat table in PostgreSQL. Then I ran async Python enrichment pipelines against PubMed, ClinicalTrials.gov, ChEMBL, and PatentsView (USPTO), producing 8 columns per record across 104,388 rows.
The interesting engineering problem was ChEMBL: most phytochemical names don't have direct ChEMBL entries, so the pipeline first tries a name match, then falls back to PubChem for CID → InChIKey resolution before hitting ChEMBL's molecule API. Full enrichment with Aho-Corasick string matching took ~24 seconds for 24,771 compounds.
Building the commercial layer on top: Rust/Actix-Web API, 97K static pSEO pages on Cloudflare Workers/R2, Stripe for one-time purchases. Solo founder, bootstrapped, based in Germany.
No comments yet
Contribute on Hacker News ↗