信息检索 (IR)、数据挖掘、机器学习 (ML)

发布于 2024-09-13 18:58:44 字数 80 浏览 16 评论 0原文

人们经常使用 IR、ML 和数据挖掘这些术语，但我注意到它们之间有很多重叠。

对于在这些领域有经验的人来说，这之间到底有什么区别？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不必在意 2024-09-20 18:58:44

这只是一个人（受过 ML 正式培训）的观点；其他人可能会以完全不同的方式看待事物。

机器学习可能是这三个术语中最同质的，也是应用最一致的——它仅限于模式提取（或模式-匹配）算法本身。

在你提到的术语中，“机器学习”是学术部门最常用于描述其课程、学术部门和研究项目的术语，也是学术期刊和会议记录中最常用的术语。 ML 显然是您提到的术语中最不依赖上下文的。

信息检索和数据挖掘更接近于描述完整的商业流程——即从用户查询到检索/交付相关结果。机器学习算法可能位于该流程中的某个位置，并且在更复杂的应用程序中通常是这样，但这不是正式要求。此外，术语“数据挖掘”似乎通常是指在“大数据”（即> 2BG）上应用某些流程，因此通常包括分布式处理（映射-reduce) 靠近该工作流程前端的组件。

因此，信息检索（IR）和数据挖掘（DM）以基础设施算法的方式与机器学习（ML）相关。换句话说，机器学习是用于解决信息检索问题的工具来源之一。但这只是工具来源之一。但IR并不依赖于ML——例如，一个特定的IR项目可能是响应用户的搜索查询IR来存储和快速检索完全索引的数据，其关键是优化数据流的性能，即，从查询到将搜索结果交付给用户的往返过程。预测或模式匹配在这里可能没有用。同样，DM 项目可能会使用 ML 算法作为预测引擎，但 DM 项目更有可能还关注整个处理流程，例如，用于高效输入大量数据（TB 或 TB）的并行计算技术），它向处理引擎提供原始结果，用于计算变量（列）的描述性统计数据（平均值、标准差、分布等）。

最后考虑一下 Netflix 奖。该竞赛仅针对机器学习——重点是预测算法，这一事实证明了只有一个成功标准：算法返回的预测的准确性想象一下，如果“Netflix 奖”被重新命名为数据挖掘竞赛，那么成功标准几乎肯定会实现。可以扩展以更准确地访问算法在实际商业环境中的性能 - 因此，例如整体执行速度（向用户提供建议的速度）可能会与准确性一起考虑。

术语“信息检索”和“数据挖掘”现在已成为主流使用，尽管有一段时间我只在我的工作描述或供应商文献中看到这些术语（通常在“解决方案”一词旁边）。在我的雇主，我们最近聘请了一位“数据挖掘”分析师。我不知道他具体做什么，但他每天上班都打领带。

This is just the view of one person (formally trained in ML); others might see things quite differently.

Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms themselves.

Of the terms you mentioned, "Machine Learning" is the one most used by Academic Departments to describe their Curricula, their academic departments, and their research programs, as well as the term most used in academic journals and conferences proceedings. ML is clearly the least context-dependent of the terms you mentioned.

Information Retrieval and Data Mining are much closer to describing complete commercial processes--i.e., from user query to retrieval/delivery of relevant results. ML algorithms might be somewhere in that process flow, and in the more sophisticated applications, often are, but that's not a formal requirement. In addition, the term Data Mining seems usually to refer to application of some process flow on big data (i.e, > 2BG) and therefore usually includes a distributed processing (map-reduce) component near the front of that workflow.

So Information Retrieval (IR) and Data Mining (DM) are related to Machine Learning (ML) in an Infrastructure-Algorithm kind of way. In other words, Machine Learning is one source of tools used to solve problems in Information Retrieval. But it's only one source of tools. But IR doesn't depend on ML--for instance, a particular IR project might be storage and rapid retrieval of the fully-indexed data responsive to a user's search query IR, the crux of which is optimizing performance of the data flow, i.e., the round-trip from query to delivering the search results to the user. Prediction or pattern matching might not be useful here. Likewise, a DM project might use an ML algorithm for the predictive engine, yet a DM project is more likely to also be concerned with the entire processing flow--for instance, parallel computation techniques for efficient input of an enormous data volume (TB perhaps) which delivers a proto-result to a processing engine for computation of descriptive statistics (mean, standard deviation, distribution, etc. on the variables (columns).

Lastly consider the Netflix Prize. This competition was directed solely to Machine Learning--the focus was on the prediction algorithm, as evidenced by the fact that there was a single success criterion: accuracy of the predictions returned by the algorithm. Imagine if the 'Netflix Prize' were rebranded as a Data Mining competition. The success criteria would almost certainly be expanded to more accurately access the algorithm's performance in the actual commercial setting--so for instance overall execution speed (how quickly are the recommendations delivered to the user) would probably be considered along with accuracy.

The terms "Information Retrieval" and "Data Mining" are now in mainstream use, though for a while I only saw these terms in my job description or in vendor literature (usually next to the word "solution.") At my employer, we recently hired a "Data Mining" analyst. I don't know what he does exactly, but he wears a tie to work every day.

回复收藏 0 原文