用正则表达式匹配文件

发布于 2024-09-05 16:59:05 字数 856 浏览 11 评论 0原文

我有一个包含电影列表的输入文件（请注意，可能有一些重复的条目）：

American_beauty__1h56mn38s_
As_Good_As_It_Gets
As_Good_As_It_Gets
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_
Capote_EN_DVDRiP_XViD-GeT-AW
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_

我会从另一个文件中找到相应的匹配项（行号）第一个文件中每个条目的参考文件：

American beauty.(1h56mn38s)
As Good As It Gets
Capote.EN.DVDRiP.XViD-GeT-AW
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman)
Quills (2000)(7.4)

所需的输出类似于（参考电影 + 参考文件中的行号）：

American beauty.(1h56mn38s) 1
As Good As It Gets 2
As Good As It Gets 2
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4
Capote.EN.DVDRiP.XViD-GeT-AW 3
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4

基本上，两个文件中的条目之间的区别在于某些字符，例如：空格、括号、点等已替换为下划线。

有人可以解释一下吗？

最美好的祝愿，

哈维尔

原文

I have an input file with a list of movies (Note that there might be some repeated entries):

American_beauty__1h56mn38s_
As_Good_As_It_Gets
As_Good_As_It_Gets
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_
Capote_EN_DVDRiP_XViD-GeT-AW
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_

I would to find the corresponding match (line number) from another
reference file for each of the entries in the first file:

American beauty.(1h56mn38s)
As Good As It Gets
Capote.EN.DVDRiP.XViD-GeT-AW
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman)
Quills (2000)(7.4)

The desired output would be something like (Reference Movie + Line number from the Reference File):

American beauty.(1h56mn38s) 1
As Good As It Gets 2
As Good As It Gets 2
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4
Capote.EN.DVDRiP.XViD-GeT-AW 3
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4

Basically, the difference between the entries in both files is that some characters such as: blank spaces, parenthesis, points, etc. have been replaced by underscores.

Does anybody could shed some light on it?

Best wishes,

Javier

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

离旧人 2024-09-12 16:59:05

awk 可以工作：

gawk '
  NR == FNR {
    # read the reference file first, capture the line numbers and transform
    # the "real" title to one with underscores
    line[$0] = NR
    u = $0
    gsub(/[][ .()]/,"_",u)
    movie[u] = $0
    next
  }
  $0 in movie {
    print movie[$0] " " line[movie[$0]]
  }
' movies.reference movies.list

如果连字符也变成下划线（那么将是 /\W/），则可以简化正则表达式。

Awk will work:

gawk '
  NR == FNR {
    # read the reference file first, capture the line numbers and transform
    # the "real" title to one with underscores
    line[$0] = NR
    u = $0
    gsub(/[][ .()]/,"_",u)
    movie[u] = $0
    next
  }
  $0 in movie {
    print movie[$0] " " line[movie[$0]]
  }
' movies.reference movies.list

The regular expression could be simplified if hyphens were also turned into underscores (would be /\W/ then).

回复收藏 0 原文

始终不够爱げ你 2024-09-12 16:59:05

也许您可以使用 sed 删除所有不需要的字符（从文件列表和文本文件中）？

例如，


ls | sed -e 's/[^a-z0-9]/o/gi'

或者如果您想要更多模糊性，您可以尝试对处理后的文件名（或标记化版本）进行一些最小编辑距离。

Maybe you could just strip all the non-desired characters (from both the file listing and textfile) using sed?

e.g


ls | sed -e 's/[^a-z0-9]/o/gi'

Or if you want more fuzziness, you could try to do some least editing distance on the processed filename (or a tokenized version).

回复收藏 0 原文

轮廓§ 2024-09-12 16:59:05

尝试一下。它不会特别快：

#!/bin/bash
chars='[]() .'
num=0
while read -r line
do
    (( num++ ))
    num=$( grep --line-number "$line" <( tr "$chars" '_' < movies.reference ) | awk -F: '{print $1}' )
    echo "$( sed -n "$num{p;q}" movies.reference ) $num"
done < movies.input

Give this a try. It won't be particularly fast:

#!/bin/bash
chars='[]() .'
num=0
while read -r line
do
    (( num++ ))
    num=$( grep --line-number "$line" <( tr "$chars" '_' < movies.reference ) | awk -F: '{print $1}' )
    echo "$( sed -n "$num{p;q}" movies.reference ) $num"
done < movies.input

回复收藏 0 原文

~没有更多了~