Comment by simonw
1 year ago
Multi-modal audio is great. I talk to ChatGPT when I'm cooking or walking the dog.
For images I use it for things like helping draft initial alt text for images, extracting tables from screenshots, translating photos of signs in languages I don't speak - and then really fun stuff like "invent a recipe to recreate this plate of food" or "my CSS renders like this, what should I change?" or "How do you think I turn on this oven?" (in an Airbnb).
I've recently started using the share-screen feature provided for Gemini by https://aistudio.google.com/live when I'm reading academic papers and I want help understanding the math. I can say "What does this symbol with the squiggle above it?" out loud and Gemini will explain it for me - works really well.
Multi-modal was the absolute game-changer.
Just last night I was digging around in my basement, pulling apart my furnace, showing pics of the inside of it, having GPT explain how it works and what I needed to do to fix it.
I would never trust an LLM to do this unless it was pointing me to pages/sections in a real manual or reputable source I could reference.
I admire your optimism that good manuals and reputable sources exist for the average furnace in the average basement.
5 replies →
Oh right, yeah I've done things like this (phone calls to ChatGPT) or the openwebui Whisper -> LLM -> TTS setup. I thought there might be something more than this by now