当前位置：文江博客话题详情

AWK Bash sed spell-checking hunspell

使用 awk 进行条件查找/替换

发布于 2024-12-11 10:15:27 字数 362 浏览 7 评论 0原文

我想解决一个常见但非常具体的问题：由于OCR错误，很多字幕文件包含字符“I”（大写i）而不是“l”（小写L）。

我的攻击计划是：

逐字处理文件
将每个单词传递给 hunspell 拼写检查器（“echo the-word | hunspell -l”如果有效则根本不产生任何响应，如果无效则产生响应
）是一个坏词，并且它包含大写的 Is，然后将其替换为小写的 l，然后重试。如果它现在是有效单词，则替换原始单词。

我当然可以在脚本中标记和重建整个文件，但在我走这条路之前，我想知道是否可以在字级使用 awk 和/或 sed 进行这些类型的条件操作？

任何其他建议的方法也将非常受欢迎！

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（2）

许你一世情深 2024-12-18 10:15:27

为此，您实际上只需要 bash：

while read line; do
  words=( $line )
  for ((i=0; i<${#words[@]}; i++)); do
    word=${words[$i]}
    if [[ $(hunspell -l <<< $word) ]]; then
      # hunspell had some output
      tmp=${word//I/l}
      if [[ $tmp != $word ]] && [[ -z $(hunspell -l <<< $tmp) ]]; then
        # no output for new word, therefore it's a dictionary word
        words[$i]=$tmp
      fi
    fi
  done
  # print the new line
  echo "${words[@]}"
done < filename > filename.new

将整个文件传递给 hunspell 并解析其输出似乎更有意义。

You don't really need more than bash for this:

while read line; do
  words=( $line )
  for ((i=0; i<${#words[@]}; i++)); do
    word=${words[$i]}
    if [[ $(hunspell -l <<< $word) ]]; then
      # hunspell had some output
      tmp=${word//I/l}
      if [[ $tmp != $word ]] && [[ -z $(hunspell -l <<< $tmp) ]]; then
        # no output for new word, therefore it's a dictionary word
        words[$i]=$tmp
      fi
    fi
  done
  # print the new line
  echo "${words[@]}"
done < filename > filename.new

It does seem to make more sense to pass the whole file to hunspell, and parse the output of that.

回复收藏 0 原文

惟欲睡 2024-12-18 10:15:27

两个建议：

在靠近问题根源的地方修复问题，即靠近 OCR 软件的地方。能不能让查字典连非“我”字都查不出来？如果没有，请尝试其他可以的 OCR 程序。
通过 hunspell 运行每个单词都会为每个单词创建一个进程，这会大量浪费 CPU 周期。 尝试使用多个遍，其中第一遍找到所有“I”单词，然后过滤掉正确的单词，然后替换每个可纠正的单词。

回复收藏 0 原文

~没有更多了~

关于作者

゛时过境迁

暂无简介

文章

评论

25 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

alipaysp_snBf0MSZIv

文章 0 评论 0

梦断已成空

文章 0 评论 0

瞎闹

文章 0 评论 0

凯凯我们等你回来

文章 0 评论 0

寄意

文章 0 评论 0

似梦非梦

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文