使用 awk 进行条件查找/替换
我想解决一个常见但非常具体的问题:由于OCR错误,很多字幕文件包含字符“I”(大写i)而不是“l”(小写L)。
我的攻击计划是:
- 逐字处理文件
- 将每个单词传递给 hunspell 拼写检查器(“echo the-word | hunspell -l”如果有效则根本不产生任何响应,如果无效则产生响应
- )是一个坏词,并且它包含大写的 Is,然后将其替换为小写的 l,然后重试。如果它现在是有效单词,则替换原始单词。
我当然可以在脚本中标记和重建整个文件,但在我走这条路之前,我想知道是否可以在字级使用 awk 和/或 sed 进行这些类型的条件操作?
任何其他建议的方法也将非常受欢迎!
I want to solve a common but very specific problem: due to OCR errors, a lot of subtitle files contain the character "I" (upper case i) instead of "l" (lower case L).
My plan of attack is:
- Process the file word by word
- Pass each word to the hunspell spellchecker ("echo the-word | hunspell -l" produces no response at all if it is valid, and a response if it is bad)
- If it is a bad word, AND it has uppercase Is in it, then replace these with lowercase l and try again. If it is now a valid word, replace the original word.
I could certainly tokenize and reconstruct the entire file in a script, but before I go down that path I was wondering if it is possible to use awk and/or sed for these kinds of conditional operations at the word-level?
Any other suggested approaches would also be very welcome!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
为此,您实际上只需要 bash:
将整个文件传递给 hunspell 并解析其输出似乎更有意义。
You don't really need more than bash for this:
It does seem to make more sense to pass the whole file to hunspell, and parse the output of that.
两个建议:
Two suggestions: