查找重复的书籍
我有一些书名及其作者的列表(没有 ISBN 号)。我想维护一个包含唯一书籍条目的列表,并删除每本书的重复条目。
我面临的问题是不同的列表可能遵循不同的约定来存储书籍的条目。例如 - 列表可能以 last name
first name
约定存储作者姓名,在另一个列表中,书籍本身的名称条目包含一些附加信息,例如书籍的名称系列以及序列号。
是否有任何标准算法来处理此类问题?我不想重新发明轮子。现在我正在使用 php 来编写解决方案。作为初学者,我尝试过levenshtein、soundex、metaphone、similar_text
,但对我来说,它们都没有前景。
示例:考虑一个继承循环的示例,该系列包含四本书。现在该系列第二本书的条目可以是Eldest
、Eldest:继承周期(第2册)
、Eldest(继承)
、 最老的(继承周期)
,继承002:最老的
。
I have a number of list of book's name along with their authors(no ISBN number). I want to maintain a single list containing unique entries of books and remove the duplicate entries of every book.
The problem I am facing is that the different list may follow different conventions to store the book's entries. For e.g - A list might store the author name in last name
first name
convention, in another list, the name entry of the book itself contains some addition information like the name of the series along with the sequence number.
Is there any standard algorithm to handle such type of problem? I don't want to reinvent the wheel. Right now I am using php to code the solution. As starters, I have tried levenshtein, soundex, metaphone, similar_text
but none of them looks promising to me.
Example: Consider an example of Inheritance Cycle, the series contains four books. Now the entry of the second book of the series can be Eldest
, Eldest: The Inheritance Cycle (Book 2)
, Eldest (Inheritance)
, Eldest (Inheritance Cycle)
, Inheritance 002: Eldest
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这听起来像是一个搜索问题,只是域更受限制。我也许会使用现有的搜索技术(也许使用 Lucene 或 Solar) 并迭代列表,首先搜索匹配项,然后如果没有找到足够接近的匹配项,添加“文档”(您拥有的信息书)到索引。
这不会是一个完美的答案,但它会给你各种比赛的分数,所以它给你一些可调整的参数来使用。如果这不仅仅是一个需要解决的一次性问题,那么这是一个特别有吸引力的解决方案,因为“算法”可以在需要时进行自我学习和调整。
This sounds like a search problem, just with a more constrained domain. I would perhaps use an existing search technology (perhaps using Lucene or Solar) and just iterate through the list, searching for a match first, and then if a sufficiently close one isn't found, adding the "document" (the info you have for one book) to the index.
It won't be a perfect answer, but it will give you a score for various matches, so it gives you some tuneable parameters to work with. This is an especially enticing solution if this is more than a 1-off problem that needs to be solved, since the "algorithm" can learn and tune itself as it goes if needed.