规范化书名 - Python

发布于 2024-08-25 18:35:39 字数 596 浏览 9 评论 0原文

我有一个书名清单：

“霍比特人：70周年纪念版”
“霍比特人”
“霍比特人（图解/收藏版）[再次回来]”
“霍比特人：或者，再次回来”
“霍比特人” ：礼品包”

等等...

我认为如果我以某种方式规范化标题，那么实现一种自动化的方式来了解每个版本所指的书会更容易。

normalised = ''.join([char for char in title 
                       if char in (string.ascii_letters + string.digits)])

或

normalised = ''
for char in title:
  if char in ':/()|':
    break
  normalised += char
return normalised

但显然它们没有按预期工作，因为标题可以包含特殊字符，并且版本基本上可以有非常不同的标题布局。

非常感谢您的帮助！谢谢：）

原文

I have a list of books titles:

"The Hobbit: 70th Anniversary Edition"
"The Hobbit"
"The Hobbit (Illustrated/Collector Edition)[There and Back Again]"
"The Hobbit: or, There and Back Again"
"The Hobbit: Gift Pack"

and so on...

I thought that if I normalised the titles somehow, it would be easier to implement an automated way to know what book each edition is referring to.

normalised = ''.join([char for char in title 
                       if char in (string.ascii_letters + string.digits)])

normalised = ''
for char in title:
  if char in ':/()|':
    break
  normalised += char
return normalised

But obviously they are not working as intended, as titles can contain special characters and editions can basically have very different title layouts.

Help would be very much appreciated! Thanks :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

↘紸啶 2024-09-01 18:35:39

这完全取决于您的数据。对于您给出的示例，一个简单的标准化解决方案可能是：

import re

book_normalized = re.sub(r':.*|\[.*?\]|\(.*?\)|\{.*?\}', '', book_name).strip()

这将为所有示例返回“霍比特人”。它的作用是删除第一个冒号之后（包括第一个冒号）或方括号（普通、方括号、大括号）以及前导和尾随空格中的任何内容。

然而，在一般情况下这并不是一个很好的解决方案，因为有些书的实际书名中有冒号或括号部分。例如，系列的名称，后跟冒号，然后是该系列的特定条目的名称。

It depends completely on your data. For the examples you gave, a simple normalization solution could be:

import re

book_normalized = re.sub(r':.*|\[.*?\]|\(.*?\)|\{.*?\}', '', book_name).strip()

This will return "The Hobbit" for all the examples. What it does is remove anything after and including the first colon, or anything in brackets (normal, square, curly) as well as leading and trailing spaces.

However, this is not a very good solution in the general case, as some books have colons or bracketed parts in the actual book name. E.g. the name of the series, followed by a colon, followed by the name of the particular entry of the series.

回复收藏 0 原文