Ask HN: How to extract structured information from captured audio?

8 hours ago

Hey HN,

I would like to extract structured information from captured audio on a device that is not too expensive (a small LLM would be an option, I got an old NVidia 1660 Super with 6GB VRAM).

OpenAI Whisper could be used to get the audio contents as text, but I don't really know how I could reliably extract the information in a structured way. There is always a "purpose", which is selected out of let's say 10 possible purposes and "required data", which is depending on the purpose and composed by key value pairs, that also have predefined values.

An example (spoken text):

  Please apply for leave from 1st November to 8th november.

Result (structured data):

  {
     purpose: "apply for leave",
     data: {
        start: "2025-11-01",
        end: "2025-11-08"
     }
  }

What are my options to do this in a reliable way that can match different purposes with different data by "best match" approach?

Related OpenAI forum topic(s) that covers related issues[0].

Old school, mark 'paragraph'/sentence, regular expression out miscellaneous info (using language linguistics / linguistic 'typing' aka noun, verb, etc) , then dump relevent remaining info in json/delimited format & normalize data (aka 1st november to 11/01). multi-pass awk script(s) / pearl / icon are languages with appropriate in-language support. use regular expressions/statistics to detect 'outliers'/mark data for human review.

multi-pass awk would require a codex/phrases related to a delimited/json tag. so first pass, identify phrases (perhaps also spell correct), categorize phrases related to given delimited field (via human intervention), then rescan, check for 'outliers'/conflicting normalizations & have script do corrects per human annotations.

Note: Normalized phonetic annotations bit easer to handle than common dictionary spelling.

[0] : https://community.openai.com/t/summarizing-and-extracting-st...