用正则表达式匹配文件
我有一个包含电影列表的输入文件(请注意,可能有一些重复的条目):
American_beauty__1h56mn38s_
As_Good_As_It_Gets
As_Good_As_It_Gets
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_
Capote_EN_DVDRiP_XViD-GeT-AW
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_
我会从另一个文件中找到相应的匹配项(行号) 第一个文件中每个条目的参考文件:
American beauty.(1h56mn38s)
As Good As It Gets
Capote.EN.DVDRiP.XViD-GeT-AW
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman)
Quills (2000)(7.4)
所需的输出类似于(参考电影 + 参考文件中的行号):
American beauty.(1h56mn38s) 1
As Good As It Gets 2
As Good As It Gets 2
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4
Capote.EN.DVDRiP.XViD-GeT-AW 3
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4
基本上,两个文件中的条目之间的区别在于某些字符,例如:空格、括号、点等已替换为下划线。
有人可以解释一下吗?
最美好的祝愿,
哈维尔
I have an input file with a list of movies (Note that there might be some repeated entries):
American_beauty__1h56mn38s_
As_Good_As_It_Gets
As_Good_As_It_Gets
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_
Capote_EN_DVDRiP_XViD-GeT-AW
_DivX-ITA__Casablanca_M_CURTIZ_1942_Bogart-bergman_
I would to find the corresponding match (line number) from another
reference file for each of the entries in the first file:
American beauty.(1h56mn38s)
As Good As It Gets
Capote.EN.DVDRiP.XViD-GeT-AW
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman)
Quills (2000)(7.4)
The desired output would be something like (Reference Movie + Line number from the Reference File):
American beauty.(1h56mn38s) 1
As Good As It Gets 2
As Good As It Gets 2
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4
Capote.EN.DVDRiP.XViD-GeT-AW 3
[DivX-ITA] Casablanca(M.CURTIZ 1942 Bogart-bergman) 4
Basically, the difference between the entries in both files is that some characters such as: blank spaces, parenthesis, points, etc. have been replaced by underscores.
Does anybody could shed some light on it?
Best wishes,
Javier
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
awk 可以工作:
如果连字符也变成下划线(那么将是
/\W/
),则可以简化正则表达式。Awk will work:
The regular expression could be simplified if hyphens were also turned into underscores (would be
/\W/
then).也许您可以使用 sed 删除所有不需要的字符(从文件列表和文本文件中)?
例如,
或者如果您想要更多模糊性,您可以尝试对处理后的文件名(或标记化版本)进行一些最小编辑距离。
Maybe you could just strip all the non-desired characters (from both the file listing and textfile) using sed?
e.g
Or if you want more fuzziness, you could try to do some least editing distance on the processed filename (or a tokenized version).
尝试一下。它不会特别快:
Give this a try. It won't be particularly fast: