Comment by sargstuff

3 months ago

Related OpenAI forum topic(s) that covers related issues[0].

Old school, mark 'paragraph'/sentence, regular expression out miscellaneous info (using language linguistics / linguistic 'typing' aka noun, verb, etc) , then dump relevent remaining info in json/delimited format & normalize data (aka 1st november to 11/01). multi-pass awk script(s) / pearl / icon are languages with appropriate in-language support. use regular expressions/statistics to detect 'outliers'/mark data for human review.

multi-pass awk would require a codex/phrases related to a delimited/json tag. so first pass, identify phrases (perhaps also spell correct), categorize phrases related to given delimited field (via human intervention), then rescan, check for 'outliers'/conflicting normalizations & have script do corrects per human annotations.

Note: Normalized phonetic annotations bit easer to handle than common dictionary spelling.

[0] : https://community.openai.com/t/summarizing-and-extracting-st...

2 comments

sargstuff

sandreas 3 months ago

Thanks, I'm going to read through the link. I also found some python libs, that do this, so since I need to run Whisper on the backend to transfer the speech to text, I think it would be suitable to use python also for tokenization - maybe spaCy (https://www.geeksforgeeks.org/tokenization-using-spacy-libra...).

sargstuff 3 months ago

Very less tramatic programming exercise than using awk. :-) aka realistic programming tool(s) for required task.