Python 字符串清理

发布于 2024-12-08 16:39:49 字数 630 浏览 0 评论 0原文

我正在用 PyQT 编写一个程序,需要获取混乱的字符串并清理它们。可能的输入值变化很大。例如,我想获取字符串:

"Seven_Pounds_(BDrip_1080p_ENG-ITA-GER)_Multisub_x264_bluray_.mkv",  
"The_Birds_1963_HDTV_XvidHD_720p-NPW.avi",  
"1892.XVID.AC3.HD.120_min.avi"  

并将它们变成:
“七磅”,
《鸟儿》,
“1892”

我曾考虑过使用 re 来转义表达式,但对于最后一个示例,此方法似乎可能会失败。 Media Gerbil 程序使用 google diff-match-patch 算法来处理字符串清理。这似乎是一个更好的选择,但我不确定如何实现它。 是否有另一种更有效的方法来清理 Python/PyQt 中的字符串,或者正则表达式或 diff-match-patch 是最好的途径?

I am writing a program in PyQT that needs to take messy strings and clean them up. The possible input values are extremely variable. For example I would like to take the strings:

"Seven_Pounds_(BDrip_1080p_ENG-ITA-GER)_Multisub_x264_bluray_.mkv",  
"The_Birds_1963_HDTV_XvidHD_720p-NPW.avi",  
"1892.XVID.AC3.HD.120_min.avi"  

and turn them into:
"Seven Pounds",
"The Birds",
"1892"

I have considered using re to escape expressions, but this method seems likely to fail for the last example. The program Media Gerbil uses the google diff-match-patch algorithm to deal with string cleaning. This seems like a better alternative, but I am not sure how to implement it.
Is there another, more effective method for cleaning strings in Python/PyQt, or is the regex or diff-match-patch the best route to follow?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

不可一世的女人 2024-12-15 16:39:49

根据您的示例:

import re

a="The_Birds_1963_HDTV_XvidHD_720p-NPW.avi"
b="Seven_Pounds_(BDrip_1080p_ENG-ITA-GER)_Multisub_x264_bluray_.mkv"
c="1892.XVID.AC3.HD.120_min.avi"

def cleanit(str):
    result = []
    l = re.split('[_.]',str)
    flag = 0
    if re.match('^[a-zA-z]+',l[0]):
        flag = 1
    elif re.match('^[0-9]+',l[0]):
        flag = 2

    if flag == 1:
        for x in l:
            if not re.match('^[a-zA-Z]+',x):
                break;
            result.append(x) 
        return " ".join(result)

    if flag == 2:
        for x in l:
            if not re.match('^[0-9]+',x):
                break;
            result.append(x) 
        return " ".join(result)

if __name__ == '__main__':
    print cleanit(a)
    print cleanit(b)
    print cleanit(c)

将打印:

kent$  python cleanit.py
The Birds
Seven Pounds
1892

based on your example:

import re

a="The_Birds_1963_HDTV_XvidHD_720p-NPW.avi"
b="Seven_Pounds_(BDrip_1080p_ENG-ITA-GER)_Multisub_x264_bluray_.mkv"
c="1892.XVID.AC3.HD.120_min.avi"

def cleanit(str):
    result = []
    l = re.split('[_.]',str)
    flag = 0
    if re.match('^[a-zA-z]+',l[0]):
        flag = 1
    elif re.match('^[0-9]+',l[0]):
        flag = 2

    if flag == 1:
        for x in l:
            if not re.match('^[a-zA-Z]+',x):
                break;
            result.append(x) 
        return " ".join(result)

    if flag == 2:
        for x in l:
            if not re.match('^[0-9]+',x):
                break;
            result.append(x) 
        return " ".join(result)

if __name__ == '__main__':
    print cleanit(a)
    print cleanit(b)
    print cleanit(c)

will print:

kent$  python cleanit.py
The Birds
Seven Pounds
1892
半世蒼涼 2024-12-15 16:39:49

从 diff-match-patch 的外观来看, match 与您所说的最接近,在我看来,它可能不是最好的解决方案,因为 match 显然想要找到特定的模式(而不是正则表达式规则)?

我认为您可能想要定义一系列正则表达式规则,例如将下划线视为单词之间的空格,以及任何可能表示标题结束的非 [a-zA-Z0-9_]+ 。您至少必须假设您的标题从字符串的开头开始,然后进行模式匹配,直到到达“非单词”字符。

也许是这样的?

rx = re.compile(r'([a-zA-Z\d_]+[a-zA-Z\d])[_.]?')

但不幸的是,正如另一篇文章中提到的在这些答案中,没有办法真正处理“鸟儿1963”。我认为解决方案是假设标题应该从哪里开始和可能停止的位置,以及可能要删除的常见标签列表的组合。

编辑 - 想到更多信息

也许一旦您尽可能缩小了潜在标题的范围,您就可以针对 IMDB 中的 API 搜索进行 google diff-match-patch。 com,并找到与真实标题最接近的匹配项

From the looks of diff-match-patch, match being the closest to what you are talking about, it seems to me that its maybe not the best solution, as match apparently wants to find specific patterns (not regex rules)?

I think you might want to define a series of regex rules, such as underscore being treated like a space between words, and any non- [a-zA-Z0-9_]+ possibly signaling the end of the title. You would have to at least make the assumption that your title starts from the beginning of the string, and then pattern match until a "non-word" character is reached.

Maybe something like this?

rx = re.compile(r'([a-zA-Z\d_]+[a-zA-Z\d])[_.]?')

But unfortunately, as mentioned in another of these answers, there is no way to really deal with "The Birds 1963". I think the solution is a combination of assuming where the title should start and possibly stop, and having the list of common tags to strip out maybe.

Edit - Thought of some more info

Maybe once you have narrowed down your potential title as far as you can get it, you could THEN do a google diff-match-patch against maybe an API search in imdb.com, and find the closest match to a real title

我们只是彼此的过ke 2024-12-15 16:39:49

我实际上曾经这样做过...您基本上遵循一系列步骤

  • 消除 []、() 或 {} 中的任何内容
  • 删除文件扩展名
  • [\s.-_] 上拆分

现在在 在这种情况下,您会得到:

Seven Pounds Multisub x264 bluray
The Birds 1963 HDTV XvidHD 720p NPW
1892 XVID AC3 HD 120 min

现在您基本上保留了一个单词列表,在查看之前要从列表中清除它。此示例中最明显的是 x264、Multisub、bluray、HDTV、XvidHD、Xvid、HD、720p、1040p、AC3。请注意,您需要在此处进行不区分大小写的比较。

请注意,当您浏览集合时,此列表将手动扩展,这

Seven Pounds
The Birds 1963
1892 120 min

可能与您在半自动化系统中获得的效果差不多。上述方法之一会告诉你清除没有出现在前面的数字,但我想指出你会弄乱像“玩具总动员2”这样的东西。

就我而言,我进行了上述处理,然后尝试找出哪些目录模式适合存档。然后我有一个基于诅咒的界面,允许我滚动并手动更正脚本的结论(包括重命名)。

编辑:再想一想,我的脚本实际上假设可以安全删除第二组数字(以及之后的所有内容)。不过,这些都是启发式的,您将会遇到异常。添加该步骤会将最后一个示例标题更正为 1892

I actually did this at one point... you basically follow a series of steps

  • Eliminate anything in []'s, ()'s or {}'s
  • Remove the file extension
  • Now split on [\s.-_]

In your case, you'll get:

Seven Pounds Multisub x264 bluray
The Birds 1963 HDTV XvidHD 720p NPW
1892 XVID AC3 HD 120 min

Now you basically keep a list of words to purge from the list before you look at it. Obvious ones from this example are x264, Multisub, bluray, HDTV, XvidHD, Xvid, HD, 720p, 1040p, AC3. Note that you'll want to do case-insensitive compares here.

Note that this list will expand manually as you go through a collection, and that leaves you with

Seven Pounds
The Birds 1963
1892 120 min

This is probably about as good as you'll get for a semi-automated system. One of the above methods would tell you to purge numbers that don't appear at the front, but I'd point out that you'll mess up things like "Toy Story 2".

In my case, I did the above processing, and then tried to figure out which directory patterns matched for archival. Then I had a curses-based interface that allowed me to scroll through and manually correct the script's conclusions (including renaming).

EDIT: On second thought, my script actually made the assumption that a second set of numbers (as well as everything afterwards) could be safely removed. These are all heuristics though, and you will run into exceptions. Adding that step would have corrected the last example title to 1892.

谁人与我共长歌 2024-12-15 16:39:49

从示例来看,无论技术如何,这似乎都非常棘手。程序如何知道 1963 不是中间电影标题的一部分?也许您最好的选择是拥有一个首字母缩略词列表,然后从第一个匹配的首字母缩略词中截断字符串并转发。它会让你处理The Birds 1963,但我真的看不出有什么办法可以解决这个问题。

Judging from the examples it looks like it will be extremely tricky, regardless of technique. How should the program know that 1963 isn't part of the title of the middle movie? Maybe your best bet is to have a list of acronyms and then truncate the string from the first matching acronym and forwards. It would give you The Birds 1963 to deal with, but I really see no way around that.

恰似旧人归 2024-12-15 16:39:49

由下划线、空格、点剪切。

过滤掉明显的部分,例如 x264BDRipmultisub

在 IMDB 中查询名称中包含这些单词的电影:)

Cut by underscores, spaces, dots.

Filter out obvious parts like x264 or BDRip or multisub.

Query IMDB for a movie with these words in the name :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文