← Back to context

Comment by jokethrowaway

10 days ago

whisper is definitely nice, but it's a bit too slow. Having subtitles and transcription for everything is great - but Nemo Parakeet (pretty much whisper by nvidia) completely changed how I interact with the computer.

It enables dictation that actually works and it's as fast as you can think. I also have a set of scripts which just wait for voice commands and do things. I can pipe the results to an LLM, run commands, synthesize a voice with F5-TTS back and it's like having a local Jarvis.

The main limitation is being english only.

Yeah, mind sharing any of the scripts? I looked at the docs briefly, looks like we need to install ALL of nemo to get access to Parakeet? Seems ultra heavy.

  • You only need the ASR bits -- this is where I got to when I previously looked into running Parakeet:

        # NeMo does not run on 3.13+
        python3.12 -m venv .venv
        source .venv/bin/activate
    
        git clone https://github.com/NVIDIA/NeMo.git nemo
        cd nemo
    
        pip install torch torchaudio torchvision --index-url https://download.pytorch.org/whl/cu128
        pip install .[asr]
    
        deactivate
    

    Then run a transcribe.py script in that venv:

        import os
        import sys
        import nemo.collections.asr as nemo_asr
    
        model_path = sys.argv[1]
        audio_path = sys.argv[2]
    
        # Load from a local path...
        asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(restore_path=model_path)
    
        # Or download from huggingface ('org/model')...
        asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name=model_path)
    
        output = asr_moel.transcribe([audio_path])
        print(output[0])
    

    With that I was able to run the model, but I ran out of memory on my lower-spec laptop. I haven't yet got around to running it on my workstation.

    You'll need to modify the python script to process the response and output it in a format you can use.