Comment by idonotknowwhy
1 year ago
> This is missing the most interesting changes in generative AI space over the last 18 months
I agree, though personally I'm liking the "big thing" as well. R1 is able to one-shot a lot of work for me, churning away in the background while I do other things.
> Multi-modal
IMO this is still early days and less reliable. What are some of your daily use cases?
> Context lengths
This is the biggest thing IMO (Models remaining coherent at > 32k contexts)
And whatever improvements have caused models like Qwen2.5 to be able to write valid code reliably vs the GPT-4 and earlier days.
There are a whole lot of useful smaller niche projects HF like extracting vocals/drums/piano from music, etc
Multi-modal audio is great. I talk to ChatGPT when I'm cooking or walking the dog.
For images I use it for things like helping draft initial alt text for images, extracting tables from screenshots, translating photos of signs in languages I don't speak - and then really fun stuff like "invent a recipe to recreate this plate of food" or "my CSS renders like this, what should I change?" or "How do you think I turn on this oven?" (in an Airbnb).
I've recently started using the share-screen feature provided for Gemini by https://aistudio.google.com/live when I'm reading academic papers and I want help understanding the math. I can say "What does this symbol with the squiggle above it?" out loud and Gemini will explain it for me - works really well.
Multi-modal was the absolute game-changer.
Just last night I was digging around in my basement, pulling apart my furnace, showing pics of the inside of it, having GPT explain how it works and what I needed to do to fix it.
I would never trust an LLM to do this unless it was pointing me to pages/sections in a real manual or reputable source I could reference.
6 replies →
Oh right, yeah I've done things like this (phone calls to ChatGPT) or the openwebui Whisper -> LLM -> TTS setup. I thought there might be something more than this by now