如何使用贝叶斯分析计算并组合多个规则的权重来识别书籍

发布于 2024-12-09 15:18:07 字数 1084 浏览 1 评论 0原文

我正在尝试一般的机器学习,特别是贝叶斯分析,通过编写一个工具来帮助我识别我的电子书收藏。输入数据由一组电子书文件组成,其名称和某些情况下的内容包含有关它们对应的书籍的提示。

有些对于人类读者来说是显而易见的,例如:

  • Artificial Intelligence - A Modern Approach 3rd.pdf
  • Microsoft Press - SharePoint Foundation 2010 Inside Out.pdf
  • The Complete Guide to PC Repair 5th Ed [2011].pdf
  • Hamlet.txt

其他则不那么明显:

  • Vsphere5.prc(实际上是 Scott Lowe 的“Mastering VSphere 5”)
  • as.ar.pdf(实际上是 Ayn Rand 的“Atlas Shrugged”)

而不是尝试编码针对不同格式的文件名的各种解析器,我想我应该构建几十个简单的规则,每个规则都有一个分数。

例如,一条规则会在文件的前几页中查找类似于 ISBN 编号的内容,如果找到,则会提出一个假设,即该文件对应于由该 ISBN 编号标识的书籍。

另一个规则将查看文件名是否采用“作者 - 标题”格式,如果是,则将提出一个假设,即作者是“作者”,标题是“标题”。其他格式的类似规则。

我想我还可以从亚马逊或 ISBN 数据库获取书名和作者列表,并搜索文件名和文件的前几页以查找其中的任何内容;找到的任何匹配都会导致该规则建议的假设。

最后,我会得到一组像这样的元组:

[rulename,hypothesis]

我希望某些规则(例如 ISBN 匹配)在可用时很有可能是正确的。其他规则,例如基于已知书名和作者的匹配,会更常见,但不那么准确。

我的问题是:

  1. 这是解决这个问题的好方法吗?
  2. 如果是这样,贝叶斯分析是否适合将所有这些规则的假设组合成复合分数,以帮助确定哪个假设最强或最有可能?
  3. 有没有更好的方法来解决这个问题,或者您可以建议我参考一些研究论文或书籍以获取更多信息?

I am experimenting with machine learning in general, and Bayesian analysis in particular, by writing a tool to help me identify my collection of e-books. The input data consist of a set of e-book files, whose names and in some cases contents contain hints as to the book they correspond to.

Some are obvious to the human reader, like:

  • Artificial Intelligence - A Modern Approach 3rd.pdf
  • Microsoft Press - SharePoint Foundation 2010 Inside Out.pdf
  • The Complete Guide to PC Repair 5th Ed [2011].pdf
  • Hamlet.txt

Others are not so obvious:

  • Vsphere5.prc (Actually 'Mastering VSphere 5' by Scott Lowe)
  • as.ar.pdf (Actually 'Atlas Shrugged' by Ayn Rand)

Rather than try to code various parsers for different formats of file names, I thought I would build a few dozen simple rules, each with a score.

For example, one rule would look in the first few pages of the file for something resembling an ISBN number, and if found would propose a hypothesis that the file corresponds to the book identified by that ISBN number.

Another rule would look to see if the file name is in 'Author - Title' format and, if so, would propose a hypothesis that the author is 'Author' and the title is 'Title'. Similar rules for other formats.

I thought I could also get a list of book titles and authors from Amazon or an ISBN database, and search the file name and first few pages of the file for any of these; any matches found would result in a hypothesis being suggested by that rule.

In the end I would have a set of tuples like this:

[rulename,hypothesis]

I expect that some rules, such as the ISBN match, will have a high probability of being correct, when they are available. Other rules, like matches based on known book titles and authors, would be more common but not as accurate.

My questions are:

  1. Is this a good approach for solving this problem?
  2. If so, is Bayesian analysis a good candidate for combining all of these rules' hypotheses into compound score to help determine which hypothesis is the strongest, or most likely?
  3. Is there a better way to solve this problem, or some research paper or book which you can suggest I turn to for more information?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

汹涌人海 2024-12-16 15:18:07

这取决于您的集合的大小以及您想要花在训练分类器上的时间。很难获得良好的概括来节省您的时间。对于任何类型的分类器,您都必须创建一个大型训练集,并在获得良好的准确性之前找到很多规则。创建规则并仅使用它们来建议标题替代方案供您选择,而不是实现分类器,可能会更有效(误报更少)。但是,如果目的是学习,那就继续吧。

It depends on the size of your collection and the time you want to spend training the classifier. It will be difficult to get good generalization that will save you time. For any type of classifier you will have to create a large training set, and also find a lot of rules before you get good accuracy. It will probably be more efficient (less false positives) to create the rules and use them only to suggest title alternatives for you to choose from, and not to implement the classifier. But, if the purpose is learning, then go ahead.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文