Python 字符串清理
我正在用 PyQT 编写一个程序,需要获取混乱的字符串并清理它们。可能的输入值变化很大。例如,我想获取字符串:
"Seven_Pounds_(BDrip_1080p_ENG-ITA-GER)_Multisub_x264_bluray_.mkv",
"The_Birds_1963_HDTV_XvidHD_720p-NPW.avi",
"1892.XVID.AC3.HD.120_min.avi"
并将它们变成:
“七磅”,
《鸟儿》,
“1892”
我曾考虑过使用 re 来转义表达式,但对于最后一个示例,此方法似乎可能会失败。 Media Gerbil 程序使用 google diff-match-patch 算法来处理字符串清理。这似乎是一个更好的选择,但我不确定如何实现它。 是否有另一种更有效的方法来清理 Python/PyQt 中的字符串,或者正则表达式或 diff-match-patch 是最好的途径?
I am writing a program in PyQT that needs to take messy strings and clean them up. The possible input values are extremely variable. For example I would like to take the strings:
"Seven_Pounds_(BDrip_1080p_ENG-ITA-GER)_Multisub_x264_bluray_.mkv",
"The_Birds_1963_HDTV_XvidHD_720p-NPW.avi",
"1892.XVID.AC3.HD.120_min.avi"
and turn them into:
"Seven Pounds",
"The Birds",
"1892"
I have considered using re to escape expressions, but this method seems likely to fail for the last example. The program Media Gerbil uses the google diff-match-patch algorithm to deal with string cleaning. This seems like a better alternative, but I am not sure how to implement it.
Is there another, more effective method for cleaning strings in Python/PyQt, or is the regex or diff-match-patch the best route to follow?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
根据您的示例:
将打印:
based on your example:
will print:
从 diff-match-patch 的外观来看, match 与您所说的最接近,在我看来,它可能不是最好的解决方案,因为 match 显然想要找到特定的模式(而不是正则表达式规则)?
我认为您可能想要定义一系列正则表达式规则,例如将下划线视为单词之间的空格,以及任何可能表示标题结束的非 [a-zA-Z0-9_]+ 。您至少必须假设您的标题从字符串的开头开始,然后进行模式匹配,直到到达“非单词”字符。
也许是这样的?
rx = re.compile(r'([a-zA-Z\d_]+[a-zA-Z\d])[_.]?')
但不幸的是,正如另一篇文章中提到的在这些答案中,没有办法真正处理“鸟儿1963”。我认为解决方案是假设标题应该从哪里开始和可能停止的位置,以及可能要删除的常见标签列表的组合。
编辑 - 想到更多信息
也许一旦您尽可能缩小了潜在标题的范围,您就可以针对 IMDB 中的 API 搜索进行 google diff-match-patch。 com,并找到与真实标题最接近的匹配项
From the looks of diff-match-patch, match being the closest to what you are talking about, it seems to me that its maybe not the best solution, as match apparently wants to find specific patterns (not regex rules)?
I think you might want to define a series of regex rules, such as underscore being treated like a space between words, and any non- [a-zA-Z0-9_]+ possibly signaling the end of the title. You would have to at least make the assumption that your title starts from the beginning of the string, and then pattern match until a "non-word" character is reached.
Maybe something like this?
rx = re.compile(r'([a-zA-Z\d_]+[a-zA-Z\d])[_.]?')
But unfortunately, as mentioned in another of these answers, there is no way to really deal with "The Birds 1963". I think the solution is a combination of assuming where the title should start and possibly stop, and having the list of common tags to strip out maybe.
Edit - Thought of some more info
Maybe once you have narrowed down your potential title as far as you can get it, you could THEN do a google diff-match-patch against maybe an API search in imdb.com, and find the closest match to a real title
我实际上曾经这样做过...您基本上遵循一系列步骤
现在在 在这种情况下,您会得到:
现在您基本上保留了一个单词列表,在查看之前要从列表中清除它。此示例中最明显的是 x264、Multisub、bluray、HDTV、XvidHD、Xvid、HD、720p、1040p、AC3。请注意,您需要在此处进行不区分大小写的比较。
请注意,当您浏览集合时,此列表将手动扩展,这
可能与您在半自动化系统中获得的效果差不多。上述方法之一会告诉你清除没有出现在前面的数字,但我想指出你会弄乱像“玩具总动员2”这样的东西。
就我而言,我进行了上述处理,然后尝试找出哪些目录模式适合存档。然后我有一个基于诅咒的界面,允许我滚动并手动更正脚本的结论(包括重命名)。
编辑:再想一想,我的脚本实际上假设可以安全删除第二组数字(以及之后的所有内容)。不过,这些都是启发式的,您将会遇到异常。添加该步骤会将最后一个示例标题更正为
1892
。I actually did this at one point... you basically follow a series of steps
In your case, you'll get:
Now you basically keep a list of words to purge from the list before you look at it. Obvious ones from this example are x264, Multisub, bluray, HDTV, XvidHD, Xvid, HD, 720p, 1040p, AC3. Note that you'll want to do case-insensitive compares here.
Note that this list will expand manually as you go through a collection, and that leaves you with
This is probably about as good as you'll get for a semi-automated system. One of the above methods would tell you to purge numbers that don't appear at the front, but I'd point out that you'll mess up things like "Toy Story 2".
In my case, I did the above processing, and then tried to figure out which directory patterns matched for archival. Then I had a curses-based interface that allowed me to scroll through and manually correct the script's conclusions (including renaming).
EDIT: On second thought, my script actually made the assumption that a second set of numbers (as well as everything afterwards) could be safely removed. These are all heuristics though, and you will run into exceptions. Adding that step would have corrected the last example title to
1892
.从示例来看,无论技术如何,这似乎都非常棘手。程序如何知道
1963
不是中间电影标题的一部分?也许您最好的选择是拥有一个首字母缩略词列表,然后从第一个匹配的首字母缩略词中截断字符串并转发。它会让你处理The Birds 1963
,但我真的看不出有什么办法可以解决这个问题。Judging from the examples it looks like it will be extremely tricky, regardless of technique. How should the program know that
1963
isn't part of the title of the middle movie? Maybe your best bet is to have a list of acronyms and then truncate the string from the first matching acronym and forwards. It would give youThe Birds 1963
to deal with, but I really see no way around that.由下划线、空格、点剪切。
过滤掉明显的部分,例如 x264 或 BDRip 或 multisub。
在 IMDB 中查询名称中包含这些单词的电影:)
Cut by underscores, spaces, dots.
Filter out obvious parts like x264 or BDRip or multisub.
Query IMDB for a movie with these words in the name :)