← Back to context

Comment by freedomben

2 days ago

Awesome! This is the type of stuff I'm most excited about with AI - improvements to medical research and capabilities. AI can be awesome at identifying patterns in data that humans can't, and there has to be troves of data out there full of patterns that we aren't catching.

Of course there's also the possibility of engineering new drugs/treatments and things, which is also super exciting.

Agreed. There is deep potential for ML in healthcare. We need more contributors advancing research in this space. One opportunity as people look around: many priors merit reconsideration.

For instance, genomic data that may seem identical may not actually be identical. In classic biological representations (FASTA), canonical cytosine and methylated cytosine are both collapsed into the letter "C" even though differences may spur differential gene expression.

What's the optimal tokenization algorithm and architecture for genomic models? How about protein binding prediction? Unclear!

There are so many open questions in biomedical ML.

The openness-impact ratio is arguably as high in biomedicine as anywhere else: if you help answer some of these questions, you could save lives.

Hopefully, awesome frameworks like this lower barriers and attract more people.

  • I'd love to hear more of our thoughts re open questions in biomedical ML. You sound like you have a crisp, nuanced grasp the landscape, which is rare. That would be very helpful to me, as an undergrad in CS (with bio) trying to crystalize research to pursue in bio/ML/GenAI.

    Thank you.

    • Thanks, but no one truly understands biomedicine, let alone biomedical ML.

      Feynman's quote -- "A scientist is never certain" -- is apt for biomedical ML.

      Context: imagine the human body as the most devilish operating system ever: 10b+ lines of code (more than merely genomics), tight coupling everywhere, zero comments. Oh, and one faulty line may cause death.

      Are you more interested in data, ML, or biology (e.g., predicting cancerous mutations or drug toxicology)?

      Biomedical data underlies everything and may be the easiest starting point because it's so bad/limited.

      We had to pay Stanford doctors to annotate QA questions because existing datasets were so unreliable. (MCQ dataset partially released, full release coming).

      For ML, MedGemma from Google DeepMind is open and at the frontier.

      Biology mostly requires publishing, but still there are ways to help.

      After sharing preferences, I can offer a more targeted path.

      2 replies →