Comment by idonotknowwhy

1 year ago

> This is missing the most interesting changes in generative AI space over the last 18 months

I agree, though personally I'm liking the "big thing" as well. R1 is able to one-shot a lot of work for me, churning away in the background while I do other things.

> Multi-modal

IMO this is still early days and less reliable. What are some of your daily use cases?

> Context lengths

This is the biggest thing IMO (Models remaining coherent at > 32k contexts)

And whatever improvements have caused models like Qwen2.5 to be able to write valid code reliably vs the GPT-4 and earlier days.

There are a whole lot of useful smaller niche projects HF like extracting vocals/drums/piano from music, etc

10 comments

idonotknowwhy

simonw 1 year ago

Multi-modal audio is great. I talk to ChatGPT when I'm cooking or walking the dog.

For images I use it for things like helping draft initial alt text for images, extracting tables from screenshots, translating photos of signs in languages I don't speak - and then really fun stuff like "invent a recipe to recreate this plate of food" or "my CSS renders like this, what should I change?" or "How do you think I turn on this oven?" (in an Airbnb).

I've recently started using the share-screen feature provided for Gemini by https://aistudio.google.com/live when I'm reading academic papers and I want help understanding the math. I can say "What does this symbol with the squiggle above it?" out loud and Gemini will explain it for me - works really well.

qingcharles 1 year ago
Multi-modal was the absolute game-changer.
Just last night I was digging around in my basement, pulling apart my furnace, showing pics of the inside of it, having GPT explain how it works and what I needed to do to fix it.
- camdenreslink 1 year ago
  
  I would never trust an LLM to do this unless it was pointing me to pages/sections in a real manual or reputable source I could reference.
  
  6 replies →
- idonotknowwhy 1 year ago
  
  Oh right, yeah I've done things like this (phone calls to ChatGPT) or the openwebui Whisper -> LLM -> TTS setup. I thought there might be something more than this by now