用正则表达式匹配文件

发布于 2024-09-05 16:59:05 字数 856 浏览 5 评论 0原文

我有一个包含电影列表的输入文件(请注意,可能有一些重复的条目):

American_beauty__1h56mn38s_
As_Good_As_It_Gets
As_Good_As_It_Gets
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_
Capote_EN_DVDRiP_XViD-GeT-AW
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_

我会从另一个文件中找到相应的匹配项(行号) 第一个文件中每个条目的参考文件:

American beauty.(1h56mn38s)
As Good As It Gets
Capote.EN.DVDRiP.XViD-GeT-AW
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman)
Quills (2000)(7.4) 

所需的输出类似于(参考电影 + 参考文件中的行号):

American beauty.(1h56mn38s) 1
As Good As It Gets 2
As Good As It Gets 2
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4
Capote.EN.DVDRiP.XViD-GeT-AW 3
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4

基本上,两个文件中的条目之间的区别在于某些字符,例如:空格、括号、点等已替换为下划线。

有人可以解释一下吗?

最美好的祝愿,

哈维尔

I have an input file with a list of movies (Note that there might be some repeated entries):

American_beauty__1h56mn38s_
As_Good_As_It_Gets
As_Good_As_It_Gets
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_
Capote_EN_DVDRiP_XViD-GeT-AW
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_

I would to find the corresponding match (line number) from another
reference file for each of the entries in the first file:

American beauty.(1h56mn38s)
As Good As It Gets
Capote.EN.DVDRiP.XViD-GeT-AW
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman)
Quills (2000)(7.4) 

The desired output would be something like (Reference Movie + Line number from the Reference File):

American beauty.(1h56mn38s) 1
As Good As It Gets 2
As Good As It Gets 2
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4
Capote.EN.DVDRiP.XViD-GeT-AW 3
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4

Basically, the difference between the entries in both files is that some characters such as: blank spaces, parenthesis, points, etc. have been replaced by underscores.

Does anybody could shed some light on it?

Best wishes,

Javier

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

离旧人 2024-09-12 16:59:05

awk 可以工作:

gawk '
  NR == FNR {
    # read the reference file first, capture the line numbers and transform
    # the "real" title to one with underscores
    line[$0] = NR
    u = $0
    gsub(/[][ .()]/,"_",u)
    movie[u] = $0
    next
  }
  $0 in movie {
    print movie[$0] " " line[movie[$0]]
  }
' movies.reference movies.list

如果连字符也变成下划线(那么将是 /\W/),则可以简化正则表达式。

Awk will work:

gawk '
  NR == FNR {
    # read the reference file first, capture the line numbers and transform
    # the "real" title to one with underscores
    line[$0] = NR
    u = $0
    gsub(/[][ .()]/,"_",u)
    movie[u] = $0
    next
  }
  $0 in movie {
    print movie[$0] " " line[movie[$0]]
  }
' movies.reference movies.list

The regular expression could be simplified if hyphens were also turned into underscores (would be /\W/ then).

始终不够爱げ你 2024-09-12 16:59:05

也许您可以使用 sed 删除所有不需要的字符(从文件列表和文本文件中)?

例如,


ls | sed -e 's/[^a-z0-9]/o/gi'

或者如果您想要更多模糊性,您可以尝试对处理后的文件名(或标记化版本)进行一些最小编辑距离。

Maybe you could just strip all the non-desired characters (from both the file listing and textfile) using sed?

e.g


ls | sed -e 's/[^a-z0-9]/o/gi'

Or if you want more fuzziness, you could try to do some least editing distance on the processed filename (or a tokenized version).

轮廓§ 2024-09-12 16:59:05

尝试一下。它不会特别快:

#!/bin/bash
chars='[]() .'
num=0
while read -r line
do
    (( num++ ))
    num=$( grep --line-number "$line" <( tr "$chars" '_' < movies.reference ) | awk -F: '{print $1}' )
    echo "$( sed -n "$num{p;q}" movies.reference ) $num"
done < movies.input

Give this a try. It won't be particularly fast:

#!/bin/bash
chars='[]() .'
num=0
while read -r line
do
    (( num++ ))
    num=$( grep --line-number "$line" <( tr "$chars" '_' < movies.reference ) | awk -F: '{print $1}' )
    echo "$( sed -n "$num{p;q}" movies.reference ) $num"
done < movies.input
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文