Comment by magicalhippo

1 year ago

I'd guess the latter.

My SO wanted Google Assistant at home after gotten used to it in our Android Automotive-based car. So I've been dabbling with local LLMs, as a learning experience.

I got one prompt which classifies the question, ie asking for weather, math question or knowledge etc. There I ask it to only output the category, so I can easily do different things based on that.

For knowledge-based stuff I include our town and country, tell it to use metric units and be brief. I tell it to ask clarifying questions if needed. If I don't it'll use miles, or both miles and km, and be too long-winded and assumes too much.

For calculations, I've been asking it to output Octave code that computes the answer, giving the result in a specific variable name, and without explanation. If it can't then output a special sequence. Without it'll include explanations of what the code does and not be consistent with variable naming.

Been using Gemma 9B so far, which performs well on my aging 2080Ti, and haven't actually put all the pieces together yet (my SO asked last weekend). But seems very promising, and adding the extra instructions for each task radically changes the output and makes this approach viable.

Btw, I know there are probably tons of these assistants out there. I just enjoy figuring out how things work.

9 comments

magicalhippo

lynx23 1 year ago

What hardware do you use for audio I/O. And what do you use for activation? I feel the rest is pretty simple to cobbler together... Should be a few days of work to surpass Siri...

magicalhippo 1 year ago
This is the tricky bit yes. I bought some Pi HATs[1] earlier, and they seem to work fine when testing, but haven't tried to use them properly yet. Alternatively I have some condenser microphones and an audio card HAT for the Pi.
So I hope to be able to use that with Pi Zero W. Played around with using PulseAudio so I could add compression plugins and such, which seems to work.
Haven't actually attempted wake-work yet, but first port of call there will be to try the ones Rhasspy uses[2].
Since I want to integrate with HA I might end up expanding on Rhasspy, or I might just use parts of Rhasspy, ie once I determine it's a HA command and not a question about dinosaurs say, inject the cleaned transcription into the Rhasspy pipeline.
[1]: https://www.seeedstudio.com/ReSpeaker-2-Mics-Pi-HAT.html
[2]: https://rhasspy.readthedocs.io/en/latest/wake-word/
- pseudosavant 1 year ago
  
  I was just looking what it would take to have a local LLM assistant and ended up on ReSpeaker too.
  I am just wrapping up another Pi Zero W audio recording project (https://github.com/pseudosavant/ps-audio-recorder). I used the first-gen Zero W since I had one already lying around. I'd strongly recommend using the newer Pi Zero 2 W though.
  The Zero W CPU is just so slow. It can barely (e.g. 1.05x) do real-time MP3 encoding, and the single core makes it prone to being interrupted by background tasks. It couldn't even do SMB/Windows file sharing faster than 0.5MB/sec without disabling encryption in Samba. Encryption kills this CPU.
  I have a couple of Zero 2s coming in the mail.

Ey7NFZ3P0nzAe 1 year ago

Do you mind sharing your code at some point? Is it using home automation? How are you asking it anything? Is it via your phone or have you setup a speaker device or what?

magicalhippo 1 year ago

If I manage to make something useful I'll certainly do that.
Planning on using small Pi Zero 2 based "boxes" with microphone, see other reply. Though have only done cursory tests.

all2 1 year ago

What are you using for glue? Just python?

magicalhippo 1 year ago
Yeah, though for now it's just proof of concept stage.
- Ey7NFZ3P0nzAe 1 year ago
  
  I'm thinking you would like langgraph

magicalhippo 1 year ago

Just for illustration, here's a concrete example. As noted I'm a total n00b and just playing around, but I think it's illustrative, including the limitations of these LLMs.

First I used a Norwegian-tuned Whisper large model to convert to convert some audio. The audio was intentionally unfocused, to simulate a real session. It outputs English so does the translation directly as well, hence the somewhat weird sentences and use of "cake shape" rather than "cake pan". The output from Whisper was as follows:

OK. Yes, think I have a cake shape. I have a recipe for a cake shape ... Wait a minute. So, I have a recipe meant for a cake shape of 30 cm. I will use a cake shape of 24 cm. How much less do I have to do the recipe to ... That it should fit.

For the following I use Gemma 9B as mentioned.

First I pass it through a cleanup step:

Input> The following statement is an audio transcript of a question or statement. It may contain errors, corrections and unrelated utterances. Rewrite it so that it is suitable for an email or text message. Remove superflous or unrelated parts. Do not extrapolate beyond what is in the transcript. The transcript is: [Whisper output]

Output> I have a cake recipe for a 30cm cake shape. I want to use a 24cm cake shape. How much less do I need to make the recipe so it fits?

Next I pass it through a classification step:

Input> Try to categorize the following question in the following categories: knowledge question, math calculation or equation, information about public transport, information about weather, unknown. Output only the category. The question is: [cleaned output]

Output> math calculation or equation

For math questions I pass it to the following prompt:

Input> Convert the following math question into octave/matlab syntax which calculates the answer if possible. The output variable should be x. Only output the code. If the question is unclear or it's not possible to calculate, output three exclamation points followed by a brief explanation. If you need additional information, include that in the explanation. The question is: [cleaned output]

Output> !!! The question requires calculating the volume ratio between two cake shapes, which is not possible with only the provided dimensions.

So we hit a limitation. For people it's obvious two cake pans would have similar height, and thus the answer is calculable. However we didn't tell it that it could make assumptions, so lets try that:

Input> Convert the following math question into octave/matlab syntax which calculates the answer if possible. The output variable should be x. Only output the code. You can make some conservative assumptions, but if the question is unclear or it's not possible to calculate, output three exclamation points followed by a brief explanation. If you need additional information, include that in the explanation. If you make some assumptions, include that as comments at the start of the code. The question is: [cleaned output]

Output>

    % Assuming the cake recipe volume is proportional to the cake shape volume
    r1 = 30/2; % Radius of the 30cm cake
    r2 = 24/2; % Radius of the 24cm cake
    h = 10; % Height of the cakes
    vol1 = pi \* r1^2 \* h; % Volume of the 30cm cake
    vol2 = pi \* r2^2 \* h; % Volume of the 24cm cake
    x = (vol1 - vol2) / vol1; % Ratio of reduction needed

So we got a satisfactory answer. However, here again you see limitations. It doesn't truly understand math. There's no need to assume an explicit height, as the heights cancel out in the final calculation.

Ok, let's try that last prompt on a different and ambiguous question:

What's the cutoff frequency for an LC circuit with R equals 500 ohm and C equals 10 nanofarrad?

Output>

    ```octave
    % Assuming L is not given, we cannot calculate the cutoff frequency.
    % The cutoff frequency formula requires the inductance (L) value.
    !!! Need the inductance (L) value to calculate the cutoff frequency.
    ```

First, it correctly determined the prompt was ambiguous to the point of not being answerable. However it also suddenly included the Markdown which I didn't tell it to do, and which it didn't in the previous case. So maybe the prompt needs to be even more explicit...