重建现在著名的 17 岁的基于马尔可夫链的信息检索算法“Apodora”
当我们都在百思不得其解时,一名 17 岁的加拿大男孩显然发现了一种信息检索算法,该算法:
a)执行精度是当前广泛使用的向量空间模型的两倍
b)“相当准确”识别相似的单词。
c)使微搜索更加准确
这是一个很好的采访。
不幸的是,我还没有找到已发表的论文,但是,从我几年前参加的图形模型和机器学习课程中的片段来看,我认为我们应该能够从他提交的摘要以及他的内容中重建它在采访中谈到这一点。
来自采访:
某些搜索会查找出现在相似上下文中的单词。那是 非常好,但这是遵循第一个关系 程度。我的算法尝试进一步追踪连接。连接 接近的被认为更有价值。理论上,如下 连接到无限程度。
摘要将其放在上下文中:
引入了一种名为“Apodora”的新型信息检索算法, 使用马尔可夫链状矩阵的极限幂来确定 文档模型并进行上下文统计推断 关于单词的语义。系统实现及对比 到向量空间模型。特别是当查询很短时, 新颖的算法给出的结果精度大约是两倍 并且在微搜索方面有有趣的应用。
我觉得了解马尔可夫链式矩阵或信息检索的人会立即意识到他在做什么。
那么:他在做什么?
While we were all twiddling our thumbs, a 17-year-old Canadian boy has apparently found an information retrieval algorithm that:
a) performs with twice the precision of the current, and widely-used vector space model
b) is 'fairly accurate' at identifying similar words.
c) makes microsearch more accurate
Here is a good interview.
Unfortunately, there's no published paper I can find yet, but, from the snatches I remember from the graphical models and machine learning classes I took a few years ago, I think we should be able to reconstruct it from his submision abstract, and what he says about it in interviews.
From interview:
Some searches find words that appear in similar contexts. That’s
pretty good, but that’s following the relationships to the first
degree. My algorithm tries to follow connections further. Connections
that are close are deemed more valuable. In theory, it follows
connections to an infinite degree.
And the abstract puts it in context:
A novel information retrieval algorithm called "Apodora" is introduced,
using limiting powers of Markov chain-like matrices to determine
models for the documents and making contextual statistical inferences
about the semantics of words. The system is implemented and compared
to the vector space model. Especially when the query is short, the
novel algorithm gives results with approximately twice the precision
and has interesting applications to microsearch.
I feel like someone who knows about markov-chain-like matrices or information retrieval would immediately be able to realize what he's doing.
So: what is he doing?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
从使用“上下文”等词以及他引入了二阶统计依赖性的事实来看,我怀疑他正在做一些与论文中概述的 LDA-HMM 方法相关的事情:Griffiths, T., Steyvers, M。 ,布莱,D.,&特南鲍姆,J.(2005)。整合主题和语法。神经信息处理系统的进展。由于模型平均,搜索分辨率存在一些固有的限制。然而,我很羡慕 17 岁时就能做这样的事情,我希望他能独立完成一些事情,至少能做得更好。即使同一主题有不同的方向也会很酷。
From the use of words like 'context' and the fact that he's introduced a second order level of statistical dependency, I suspect he is doing something related to the LDA-HMM method outlined in the paper: Griffiths, T., Steyvers, M., Blei, D., & Tenenbaum, J. (2005). Integrating topics and syntax. Advances in Neural Information Processing Systems. There are some inherent limits to the resolution of the search due to model averaging. However, I'm envious of doing stuff like this at 17 and I hope to heck he's done something independent and at least incrementally better. Even a different direction on the same topic would be pretty cool.