计算机科学研究生需要学习哪些科目、主题来应用可用的机器学习框架,尤其是。支持向量机
我想自学足够多的机器学习知识,以便我能够首先充分理解并使用可用开源机器学习框架,这些框架将允许我做如下事情:
浏览HTML 页面源 从某个网站并“理解” 哪些部分构成内容, 哪些广告和哪些 形成元数据(既不是 内容,也不是广告 - 例如。 - TOC、作者简介等)
浏览页面的 HTML 源代码 来自不同站点并“分类” 该网站是否属于 是否预定义类别(列表 将提供类别 之前)1.
...类似的分类任务 文本和页面。
正如您所看到的,我的直接要求是对不同数据源和大量数据进行分类。
就我有限的理解而言,采用神经网络方法比使用 SVM 需要大量的训练和维护?
我知道 SVM 非常适合像我这样的(二进制)分类任务,并且像 libSVM 这样的开源框架相当成熟?
那么,什么主题和主题 计算机科学专业的毕业生需要吗 现在就去学习,这样上面的 可以解决需求,把 这些框架要使用吗?
我想远离 Java,这是可能的,而且我没有其他语言偏好。我愿意学习并付出尽可能多的努力。
我的目的不是从头开始编写代码,而是首先将各种框架可供使用(尽管我不知道足以决定使用哪个框架),并且如果出现问题,我应该能够修复它们< /强>。
你对学习统计和概率论的特定部分的建议对我来说并不意外,所以如果需要的话就说吧!
如果需要,我会根据您的所有建议和反馈修改此问题。
I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:
Go through the HTML source of pages
from a certain site and "understand"
which sections form the content,
which the advertisements and which
form the metadata ( neither the
content, nor the ads - for eg. -
TOC, author bio etc )Go through the HTML source of pages
from disparate sites and "classify"
whether the site belongs to a
predefined category or not ( list of
categories will be supplied
beforhand )1.... similar classification tasks on
text and pages.
As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.
As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?
I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?
In that case, what subjects and topics
does a computer science graduate need
to learn right now, so that the above
requirements can be solved, putting
these frameworks to use?
I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.
My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.
Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!
I will modify this question if needed, depending on all your suggestions and feedback.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
机器学习中的“理解”相当于拥有一个模型。例如,模型可以是支持向量的集合、神经网络的布局和权重、决策树等。其中哪种方法最有效实际上取决于您正在学习的主题以及训练数据的质量。
在您的情况下,从 HTML 网站集合中学习,您会首先对数据进行预处理,此步骤也称为“特征提取”。也就是说,您从正在查看的页面中提取信息。这是一个困难的步骤,因为它需要领域知识,并且您必须提取有用的信息,否则您的分类器将无法做出很好的区分。特征提取将为您提供一个数据集(每行包含特征的矩阵),您可以从中创建模型。
一般来说,在机器学习中,建议还保留一个“测试集”,您不使用它来训练模型,但您将在最后使用它来决定什么是最佳方法。在建模步骤结束之前保持测试集隐藏是极其重要的!测试数据基本上会提示您模型正在产生的“泛化错误”。任何具有足够复杂性和学习时间的模型都倾向于准确地学习您训练它所用的信息。机器学习者表示该模型“过度拟合”了训练数据。这样的过拟合模型看起来似乎不错,但这只是记忆而已。
虽然对数据预处理的软件支持非常稀疏且高度依赖于领域,但正如 adam 提到的 Weka< /a> 是一个很好的免费工具,可以在获得数据集后应用不同的方法。我建议阅读几本书。 Vladimir Vapnik 撰写了《统计学习理论的本质》,他是 SVM 的发明者。你应该熟悉建模的过程,所以一本关于机器学习的书绝对是非常有用的。我还希望一些术语可能对您找到解决方法有所帮助。
"Understanding" in machine learn is the equivalent of having a model. The model can be for example a collection of support vectors, the layout and weights of a neural network, a decision tree, or more. Which of these methods work best really depends on the subject you're learning from and on the quality of your training data.
In your case, learning from a collection of HTML sites, you will like to preprocess the data first, this step is also called "feature extraction". That is, you extract information out of the page you're looking at. This is a difficult step, because it requires domain knowledge and you'll have to extract useful information, or otherwise your classifiers will not be able to make good distinctions. Feature extraction will give you a dataset (a matrix with features for each row) from which you'll be able to create your model.
Generally in machine learning it is advised to also keep a "test set" that you do not train your models with, but that you will use at the end to decide on what is the best method. It is of extreme importance that you keep the test set hidden until the very end of your modeling step! The test data basically gives you a hint on the "generalization error" that your model is making. Any model with enough complexity and learning time tends to learn exactly the information that you train it with. Machine learners say that the model "overfits" the training data. Such overfitted models seem to appear good, but this is just memorization.
While software support for preprocessing data is very sparse and highly domain dependent, as adam mentioned Weka is a good free tool for applying different methods once you have your dataset. I would recommend reading several books. Vladimir Vapnik wrote "The Nature of Statistical Learning Theory", he is the inventor of SVMs. You should get familiar with the process of modeling, so a book on machine learning is definitely very useful. I also hope that some of the terminology might be helpful to you in finding your way around.
对我来说,这似乎是一项相当复杂的任务;第 2 步(分类)很“简单”,但第 1 步似乎是一个结构学习任务。您可能希望将其简化为对 HTML 树的某些部分进行分类,可能是通过某种启发式方法预先选择的。
Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.
使用最广泛的通用机器学习库(免费)可能是 WEKA 。他们有一本书介绍了一些机器学习概念并介绍了如何使用他们的软件。不幸的是,它完全是用 Java 编写的。
我并不是真正的 Python 爱好者,但如果没有很多 工具 也可用。
对于目前基于文本的分类,朴素贝叶斯、决策树(我认为尤其是 J48)和 SVM 方法给出了最好的结果。然而,它们各自更适合略有不同的应用。我无法确定哪一个最适合您。使用 WEKA 这样的工具,您可以使用一些示例数据尝试所有三种方法,而无需编写一行代码并亲自查看。
我倾向于回避神经网络,因为它们很快就会变得非常非常复杂。话又说回来,我没有尝试过与他们一起进行大型项目,主要是因为他们在学术界享有盛誉。
仅当您使用概率算法(如朴素贝叶斯)时才需要概率和统计知识。 SVM 通常不以概率方式使用。
从表面上看,您可能需要购买一本实际的模式分类教科书或参加相关课程,以便准确找到您想要的内容。对于自定义/非标准数据集,如果不调查现有技术,就很难获得良好的结果。
The most widely used general machine learning library (freely) available is probably WEKA. They have a book that introduces some ML concepts and covers how to use their software. Unfortunately for you, it is written entirely in Java.
I am not really a Python person, but it would surprise me if there aren't also a lot of tools available for it as well.
For text-based classification right now Naive Bayes, Decision Trees (J48 in particular I think), and SVM approaches are giving the best results. However they are each more suited for slightly different applications. Off the top of my head I'm not sure which would suit you the best. With a tool like WEKA you could try all three approaches with some example data without writing a line of code and see for yourself.
I tend to shy away from Neural Networks simply because they can get very very complicated quickly. Then again, I haven't tried a large project with them mostly because they have that reputation in academia.
Probability and statistics knowledge is only required if you are using probabilistic algorithms (like Naive Bayes). SVMs are generally not used in a probabilistic manner.
From the sound of it, you may want to invest in an actual pattern classification textbook or take a class on it in order to find exactly what you are looking for. For custom/non-standard data sets it can be tricky to get good results without having a survey of existing techniques.
在我看来,您现在正在进入机器学习领域,所以我真的很想建议您看看本书:它不仅对最常见的机器学习方法和算法(及其变体)提供了深入而广泛的概述,而且还提供了一套很好的练习和科学论文链接。所有这些都包含在一种富有洞察力的语言中,其中包含有关统计和概率的最小但有用的概要
It seems to me that you are now entering machine learning field, so I'd really like to suggest to have a look at this book: not only it provides a deep and vast overview on the most common machine learning approaches and algorithms (and their variations) but it also provides a very good set of exercises and scientific paper links. All of this is wrapped in an insightful language starred with a minimal and yet useful compendium about statistics and probability