信息检索 (IR)、数据挖掘、机器学习 (ML)
人们经常使用 IR、ML 和数据挖掘这些术语,但我注意到它们之间有很多重叠。
对于在这些领域有经验的人来说,这之间到底有什么区别?
People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them.
From people with experience in these fields, what exactly draws the line between these?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这只是一个人(受过 ML 正式培训)的观点;其他人可能会以完全不同的方式看待事物。
机器学习可能是这三个术语中最同质的,也是应用最一致的——它仅限于模式提取(或模式-匹配)算法本身。
在你提到的术语中,“机器学习”是学术部门最常用于描述其课程、学术部门和研究项目的术语,也是学术期刊和会议记录中最常用的术语。 ML 显然是您提到的术语中最不依赖上下文的。
信息检索和数据挖掘更接近于描述完整的商业流程——即从用户查询到检索/交付相关结果。机器学习算法可能位于该流程中的某个位置,并且在更复杂的应用程序中通常是这样,但这不是正式要求。此外,术语“数据挖掘”似乎通常是指在“大数据”(即> 2BG)上应用某些流程,因此通常包括分布式处理(映射-reduce) 靠近该工作流程前端的组件。
因此,信息检索(IR)和数据挖掘(DM)以基础设施算法的方式与机器学习(ML)相关。换句话说,机器学习是用于解决信息检索问题的工具来源之一。但这只是工具来源之一。但IR并不依赖于ML——例如,一个特定的IR项目可能是响应用户的搜索查询IR来存储和快速检索完全索引的数据,其关键是优化数据流的性能,即,从查询到将搜索结果交付给用户的往返过程。预测或模式匹配在这里可能没有用。同样,DM 项目可能会使用 ML 算法作为预测引擎,但 DM 项目更有可能还关注整个处理流程,例如,用于高效输入大量数据(TB 或 TB)的并行计算技术),它向处理引擎提供原始结果,用于计算变量(列)的描述性统计数据(平均值、标准差、分布等)。
最后考虑一下 Netflix 奖。该竞赛仅针对机器学习——重点是预测算法,这一事实证明了只有一个成功标准:算法返回的预测的准确性想象一下,如果“Netflix 奖”被重新命名为数据挖掘竞赛,那么成功标准几乎肯定会实现。可以扩展以更准确地访问算法在实际商业环境中的性能 - 因此,例如整体执行速度(向用户提供建议的速度)可能会与准确性一起考虑。
术语“信息检索”和“数据挖掘”现在已成为主流使用,尽管有一段时间我只在我的工作描述或供应商文献中看到这些术语(通常在“解决方案”一词旁边)。在我的雇主,我们最近聘请了一位“数据挖掘”分析师。我不知道他具体做什么,但他每天上班都打领带。
This is just the view of one person (formally trained in ML); others might see things quite differently.
Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms themselves.
Of the terms you mentioned, "Machine Learning" is the one most used by Academic Departments to describe their Curricula, their academic departments, and their research programs, as well as the term most used in academic journals and conferences proceedings. ML is clearly the least context-dependent of the terms you mentioned.
Information Retrieval and Data Mining are much closer to describing complete commercial processes--i.e., from user query to retrieval/delivery of relevant results. ML algorithms might be somewhere in that process flow, and in the more sophisticated applications, often are, but that's not a formal requirement. In addition, the term Data Mining seems usually to refer to application of some process flow on big data (i.e, > 2BG) and therefore usually includes a distributed processing (map-reduce) component near the front of that workflow.
So Information Retrieval (IR) and Data Mining (DM) are related to Machine Learning (ML) in an Infrastructure-Algorithm kind of way. In other words, Machine Learning is one source of tools used to solve problems in Information Retrieval. But it's only one source of tools. But IR doesn't depend on ML--for instance, a particular IR project might be storage and rapid retrieval of the fully-indexed data responsive to a user's search query IR, the crux of which is optimizing performance of the data flow, i.e., the round-trip from query to delivering the search results to the user. Prediction or pattern matching might not be useful here. Likewise, a DM project might use an ML algorithm for the predictive engine, yet a DM project is more likely to also be concerned with the entire processing flow--for instance, parallel computation techniques for efficient input of an enormous data volume (TB perhaps) which delivers a proto-result to a processing engine for computation of descriptive statistics (mean, standard deviation, distribution, etc. on the variables (columns).
Lastly consider the Netflix Prize. This competition was directed solely to Machine Learning--the focus was on the prediction algorithm, as evidenced by the fact that there was a single success criterion: accuracy of the predictions returned by the algorithm. Imagine if the 'Netflix Prize' were rebranded as a Data Mining competition. The success criteria would almost certainly be expanded to more accurately access the algorithm's performance in the actual commercial setting--so for instance overall execution speed (how quickly are the recommendations delivered to the user) would probably be considered along with accuracy.
The terms "Information Retrieval" and "Data Mining" are now in mainstream use, though for a while I only saw these terms in my job description or in vendor literature (usually next to the word "solution.") At my employer, we recently hired a "Data Mining" analyst. I don't know what he does exactly, but he wears a tie to work every day.
我尝试如下划清界限:
信息检索就是尽快找到已经属于数据一部分的内容。
机器学习是将现有知识尽可能准确地推广到新数据的技术。
数据挖掘主要是发现数据中隐藏的、您以前不知道的东西,并尽可能“新”。
它们相互交叉并经常使用彼此的技术。 DM 和 IR 都使用索引结构来加速流程。 DM 使用了很多 ML 技术,例如数据集中对泛化有用的模式可能是新知识。
它们往往很难分开。帮自己一个忙,不要只关注流行语。在我看来,区分它们的最好方法是通过它们的意图,如上所述:查找数据、泛化到新数据、查找现有数据的新属性。
I'd try to draw the line as follows:
Information retrieval is about finding something that already is part of your data, as fast as possible.
Machine learning are techniques to generalize existing knowledge to new data, as accurate as possible.
Data mining is primarly about discovering something hidden in your data, that you did not know before, as "new" as possible.
They intersect and often use techniques of one another. DM and IR both use index structures to accelerate processes. DM uses a lot of ML techniques, for example a pattern in the data set that is useful for generalization might be a new knowledge.
They are often hard to separate. Do yourself a favor and don't just go for the buzzwords. In my opinion the best way of distinguishing them is by their intention, as given above: find data, generalize to new data, find new properties of existing data.
您还可以添加模式识别和(计算?)统计作为与您提到的三个领域重叠的另外几个领域。
我想说它们之间没有明确的界限。他们的不同之处在于他们的历史和他们的侧重点。统计学强调数学严谨性,数据挖掘强调扩展到大型数据集,机器学习则介于两者之间。
You can also add pattern recognition and (computational?) statistics as another couple of areas that overlap with the three you mentioned.
I'd say there is no well-defined line between them. What separates them is their history and their emphases. Statistics emphasizes mathematical rigor, data mining emphasizes scaling to large datasets, ML is somewhere in between.
数据挖掘是关于发现隐藏的模式或未知的知识,这些知识可以被用来
供人们决策。
机器学习是关于学习模型来对新对象进行分类。
Data mining is about discovering hidden patterns or unknown knowledge, which can be used
for decision making by people.
Machine learning is about learning a model to classify new objects.