Python 字符串清理

发布于 2024-12-08 16:39:49 字数 630 浏览 0 评论 0原文

我正在用 PyQT 编写一个程序，需要获取混乱的字符串并清理它们。可能的输入值变化很大。例如，我想获取字符串：

"Seven_Pounds_(BDrip_1080p_ENG-ITA-GER)_Multisub_x264_bluray_.mkv",  
"The_Birds_1963_HDTV_XvidHD_720p-NPW.avi",  
"1892.XVID.AC3.HD.120_min.avi"

并将它们变成：
“七磅”，
《鸟儿》，
“1892”

我曾考虑过使用 re 来转义表达式，但对于最后一个示例，此方法似乎可能会失败。 Media Gerbil 程序使用 google diff-match-patch 算法来处理字符串清理。这似乎是一个更好的选择，但我不确定如何实现它。是否有另一种更有效的方法来清理 Python/PyQt 中的字符串，或者正则表达式或 diff-match-patch 是最好的途径？

原文

I am writing a program in PyQT that needs to take messy strings and clean them up. The possible input values are extremely variable. For example I would like to take the strings:

"Seven_Pounds_(BDrip_1080p_ENG-ITA-GER)_Multisub_x264_bluray_.mkv",  
"The_Birds_1963_HDTV_XvidHD_720p-NPW.avi",  
"1892.XVID.AC3.HD.120_min.avi"

and turn them into:
"Seven Pounds",
"The Birds",
"1892"

I have considered using re to escape expressions, but this method seems likely to fail for the last example. The program Media Gerbil uses the google diff-match-patch algorithm to deal with string cleaning. This seems like a better alternative, but I am not sure how to implement it.
Is there another, more effective method for cleaning strings in Python/PyQt, or is the regex or diff-match-patch the best route to follow?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不可一世的女人 2024-12-15 16:39:49

根据您的示例：

import re

a="The_Birds_1963_HDTV_XvidHD_720p-NPW.avi"
b="Seven_Pounds_(BDrip_1080p_ENG-ITA-GER)_Multisub_x264_bluray_.mkv"
c="1892.XVID.AC3.HD.120_min.avi"

def cleanit(str):
    result = []
    l = re.split('[_.]',str)
    flag = 0
    if re.match('^[a-zA-z]+',l[0]):
        flag = 1
    elif re.match('^[0-9]+',l[0]):
        flag = 2

    if flag == 1:
        for x in l:
            if not re.match('^[a-zA-Z]+',x):
                break;
            result.append(x) 
        return " ".join(result)

    if flag == 2:
        for x in l:
            if not re.match('^[0-9]+',x):
                break;
            result.append(x) 
        return " ".join(result)

if __name__ == '__main__':
    print cleanit(a)
    print cleanit(b)
    print cleanit(c)

将打印：

kent$  python cleanit.py
The Birds
Seven Pounds
1892

based on your example:

import re

a="The_Birds_1963_HDTV_XvidHD_720p-NPW.avi"
b="Seven_Pounds_(BDrip_1080p_ENG-ITA-GER)_Multisub_x264_bluray_.mkv"
c="1892.XVID.AC3.HD.120_min.avi"

def cleanit(str):
    result = []
    l = re.split('[_.]',str)
    flag = 0
    if re.match('^[a-zA-z]+',l[0]):
        flag = 1
    elif re.match('^[0-9]+',l[0]):
        flag = 2

    if flag == 1:
        for x in l:
            if not re.match('^[a-zA-Z]+',x):
                break;
            result.append(x) 
        return " ".join(result)

    if flag == 2:
        for x in l:
            if not re.match('^[0-9]+',x):
                break;
            result.append(x) 
        return " ".join(result)

if __name__ == '__main__':
    print cleanit(a)
    print cleanit(b)
    print cleanit(c)

will print:

kent$  python cleanit.py
The Birds
Seven Pounds
1892

回复收藏 0 原文

半世蒼涼 2024-12-15 16:39:49

从 diff-match-patch 的外观来看， match 与您所说的最接近，在我看来，它可能不是最好的解决方案，因为 match 显然想要找到特定的模式（而不是正则表达式规则）？

我认为您可能想要定义一系列正则表达式规则，例如将下划线视为单词之间的空格，以及任何可能表示标题结束的非 [a-zA-Z0-9_]+ 。您至少必须假设您的标题从字符串的开头开始，然后进行模式匹配，直到到达“非单词”字符。

也许是这样的？

rx = re.compile(r'([a-zA-Z\d_]+[a-zA-Z\d])[_.]?')

但不幸的是，正如另一篇文章中提到的在这些答案中，没有办法真正处理“鸟儿1963”。我认为解决方案是假设标题应该从哪里开始和可能停止的位置，以及可能要删除的常见标签列表的组合。

编辑 - 想到更多信息

也许一旦您尽可能缩小了潜在标题的范围，您就可以针对 IMDB 中的 API 搜索进行 google diff-match-patch。 com，并找到与真实标题最接近的匹配项

回复收藏 0 原文

我们只是彼此的过ke 2024-12-15 16:39:49

我实际上曾经这样做过...您基本上遵循一系列步骤

消除 []、() 或 {} 中的任何内容
删除文件扩展名
[\s.-_] 上拆分

现在在在这种情况下，您会得到：

Seven Pounds Multisub x264 bluray
The Birds 1963 HDTV XvidHD 720p NPW
1892 XVID AC3 HD 120 min

现在您基本上保留了一个单词列表，在查看之前要从列表中清除它。此示例中最明显的是 x264、Multisub、bluray、HDTV、XvidHD、Xvid、HD、720p、1040p、AC3。请注意，您需要在此处进行不区分大小写的比较。

请注意，当您浏览集合时，此列表将手动扩展，这

Seven Pounds
The Birds 1963
1892 120 min

可能与您在半自动化系统中获得的效果差不多。上述方法之一会告诉你清除没有出现在前面的数字，但我想指出你会弄乱像“玩具总动员2”这样的东西。

就我而言，我进行了上述处理，然后尝试找出哪些目录模式适合存档。然后我有一个基于诅咒的界面，允许我滚动并手动更正脚本的结论（包括重命名）。

编辑：再想一想，我的脚本实际上假设可以安全删除第二组数字（以及之后的所有内容）。不过，这些都是启发式的，您将会遇到异常。添加该步骤会将最后一个示例标题更正为 1892。

I actually did this at one point... you basically follow a series of steps

Eliminate anything in []'s, ()'s or {}'s
Remove the file extension
Now split on [\s.-_]

In your case, you'll get:

Seven Pounds Multisub x264 bluray
The Birds 1963 HDTV XvidHD 720p NPW
1892 XVID AC3 HD 120 min

Now you basically keep a list of words to purge from the list before you look at it. Obvious ones from this example are x264, Multisub, bluray, HDTV, XvidHD, Xvid, HD, 720p, 1040p, AC3. Note that you'll want to do case-insensitive compares here.

Note that this list will expand manually as you go through a collection, and that leaves you with

Seven Pounds
The Birds 1963
1892 120 min

This is probably about as good as you'll get for a semi-automated system. One of the above methods would tell you to purge numbers that don't appear at the front, but I'd point out that you'll mess up things like "Toy Story 2".

In my case, I did the above processing, and then tried to figure out which directory patterns matched for archival. Then I had a curses-based interface that allowed me to scroll through and manually correct the script's conclusions (including renaming).

EDIT: On second thought, my script actually made the assumption that a second set of numbers (as well as everything afterwards) could be safely removed. These are all heuristics though, and you will run into exceptions. Adding that step would have corrected the last example title to 1892.

回复收藏 0 原文