使用 awk 进行条件查找/替换

发布于 2024-12-11 10:15:27 字数 362 浏览 0 评论 0原文

我想解决一个常见但非常具体的问题:由于OCR错误,很多字幕文件包含字符“I”(大写i)而不是“l”(小写L)。

我的攻击计划是:

  1. 逐字处理文件
  2. 将每个单词传递给 hunspell 拼写检查器(“echo the-word | hunspell -l”如果有效则根本不产生任何响应,如果无效则产生响应
  3. )是一个坏词,并且它包含大写的 Is,然后将其替换为小写的 l,然后重试。如果它现在是有效单词,则替换原始单词。

我当然可以在脚本中标记和重建整个文件,但在我走这条路之前,我想知道是否可以在字级使用 awk 和/或 sed 进行这些类型的条件操作?

任何其他建议的方法也将非常受欢迎!

I want to solve a common but very specific problem: due to OCR errors, a lot of subtitle files contain the character "I" (upper case i) instead of "l" (lower case L).

My plan of attack is:

  1. Process the file word by word
  2. Pass each word to the hunspell spellchecker ("echo the-word | hunspell -l" produces no response at all if it is valid, and a response if it is bad)
  3. If it is a bad word, AND it has uppercase Is in it, then replace these with lowercase l and try again. If it is now a valid word, replace the original word.

I could certainly tokenize and reconstruct the entire file in a script, but before I go down that path I was wondering if it is possible to use awk and/or sed for these kinds of conditional operations at the word-level?

Any other suggested approaches would also be very welcome!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

许你一世情深 2024-12-18 10:15:27

为此,您实际上只需要 bash:

while read line; do
  words=( $line )
  for ((i=0; i<${#words[@]}; i++)); do
    word=${words[$i]}
    if [[ $(hunspell -l <<< $word) ]]; then
      # hunspell had some output
      tmp=${word//I/l}
      if [[ $tmp != $word ]] && [[ -z $(hunspell -l <<< $tmp) ]]; then
        # no output for new word, therefore it's a dictionary word
        words[$i]=$tmp
      fi
    fi
  done
  # print the new line
  echo "${words[@]}"
done < filename > filename.new

将整个文件传递给 hunspell 并解析其输出似乎更有意义。

You don't really need more than bash for this:

while read line; do
  words=( $line )
  for ((i=0; i<${#words[@]}; i++)); do
    word=${words[$i]}
    if [[ $(hunspell -l <<< $word) ]]; then
      # hunspell had some output
      tmp=${word//I/l}
      if [[ $tmp != $word ]] && [[ -z $(hunspell -l <<< $tmp) ]]; then
        # no output for new word, therefore it's a dictionary word
        words[$i]=$tmp
      fi
    fi
  done
  # print the new line
  echo "${words[@]}"
done < filename > filename.new

It does seem to make more sense to pass the whole file to hunspell, and parse the output of that.

惟欲睡 2024-12-18 10:15:27

两个建议:

  1. 在靠近问题根源的地方修复问题,即靠近 OCR 软件的地方。能不能让查字典连非“我”字都查不出来?如果没有,请尝试其他可以的 OCR 程序。
  2. 通过 hunspell 运行每个单词都会为每个单词创建一个进程,这会大量浪费 CPU 周期。 尝试使用多个遍,其中第一遍找到所有“I”单词,然后过滤掉正确的单词,然后替换每个可纠正的单词。

Two suggestions:

  1. Fix the problem closer to where it originates, i.e. near the OCR Software. Can it be made to consult a dictionary and don't even come up with non-words containing 'I'? If not, try a different OCR program that can.
  2. Running each word through hunspell creates a process for each word, which is a massive waste of CPU cycles. Try using several passes, where the first pass finds all 'I' words, then filter out correct words, then replace each correctable word.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文