规范化书名 - Python
我有一个书名清单:
- “霍比特人:70周年纪念版”
- “霍比特人”
- “霍比特人(图解/收藏版)[再次回来]”
- “霍比特人:或者,再次回来”
- “霍比特人” :礼品包”
等等...
我认为如果我以某种方式规范化标题,那么实现一种自动化的方式来了解每个版本所指的书会更容易。
normalised = ''.join([char for char in title
if char in (string.ascii_letters + string.digits)])
或
normalised = ''
for char in title:
if char in ':/()|':
break
normalised += char
return normalised
但显然它们没有按预期工作,因为标题可以包含特殊字符,并且版本基本上可以有非常不同的标题布局。
非常感谢您的帮助!谢谢 :)
I have a list of books titles:
- "The Hobbit: 70th Anniversary Edition"
- "The Hobbit"
- "The Hobbit (Illustrated/Collector Edition)[There and Back Again]"
- "The Hobbit: or, There and Back Again"
- "The Hobbit: Gift Pack"
and so on...
I thought that if I normalised the titles somehow, it would be easier to implement an automated way to know what book each edition is referring to.
normalised = ''.join([char for char in title
if char in (string.ascii_letters + string.digits)])
or
normalised = ''
for char in title:
if char in ':/()|':
break
normalised += char
return normalised
But obviously they are not working as intended, as titles can contain special characters and editions can basically have very different title layouts.
Help would be very much appreciated! Thanks :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这完全取决于您的数据。对于您给出的示例,一个简单的标准化解决方案可能是:
这将为所有示例返回“霍比特人”。它的作用是删除第一个冒号之后(包括第一个冒号)或方括号(普通、方括号、大括号)以及前导和尾随空格中的任何内容。
然而,在一般情况下这并不是一个很好的解决方案,因为有些书的实际书名中有冒号或括号部分。例如,系列的名称,后跟冒号,然后是该系列的特定条目的名称。
It depends completely on your data. For the examples you gave, a simple normalization solution could be:
This will return "The Hobbit" for all the examples. What it does is remove anything after and including the first colon, or anything in brackets (normal, square, curly) as well as leading and trailing spaces.
However, this is not a very good solution in the general case, as some books have colons or bracketed parts in the actual book name. E.g. the name of the series, followed by a colon, followed by the name of the particular entry of the series.
我建议使用第三方网络服务,例如 librarything 我相信它可以满足您的要求首先,请参阅他们的文档:
http:// www.librarything.com/services/rest/documentation/1.0/librarything.ck.getwork.php
I would suggest using a 3rd party web service, such as librarything which I believe can do what you're asking, for a starting point, see their documentation:
http://www.librarything.com/services/rest/documentation/1.0/librarything.ck.getwork.php