← Back to context

Comment by sargstuff

12 hours ago

Related OpenAI forum topic(s) that covers related issues[0].

Old school, mark 'paragraph'/sentence, regular expression out miscellaneous info (using language linguistics / linguistic 'typing' aka noun, verb, etc) , then dump relevent remaining info in json/delimited format & normalize data (aka 1st november to 11/01). multi-pass awk script(s) / pearl / icon are languages with appropriate in-language support. use regular expressions/statistics to detect 'outliers'/mark data for human review.

multi-pass awk would require a codex/phrases related to a delimited/json tag. so first pass, identify phrases (perhaps also spell correct), categorize phrases related to given delimited field (via human intervention), then rescan, check for 'outliers'/conflicting normalizations & have script do corrects per human annotations.

Note: Normalized phonetic annotations bit easer to handle than common dictionary spelling.

[0] : https://community.openai.com/t/summarizing-and-extracting-st...