当前位置：文江博客话题详情

计算机科学研究生需要学习哪些科目、主题来应用可用的机器学习框架，尤其是。支持向量机

发布于 2024-09-25 00:41:27 字数 814 浏览 6 评论 0原文

我想自学足够多的机器学习知识，以便我能够首先充分理解并使用可用开源机器学习框架，这些框架将允许我做如下事情：

浏览HTML 页面源从某个网站并“理解” 哪些部分构成内容，哪些广告和哪些形成元数据（既不是内容，也不是广告 - 例如。 - TOC、作者简介等）
浏览页面的 HTML 源代码来自不同站点并“分类” 该网站是否属于是否预定义类别（列表将提供类别之前）1.
...类似的分类任务文本和页面。

正如您所看到的，我的直接要求是对不同数据源和大量数据进行分类。

就我有限的理解而言，采用神经网络方法比使用 SVM 需要大量的训练和维护？

我知道 SVM 非常适合像我这样的（二进制）分类任务，并且像 libSVM 这样的开源框架相当成熟？

那么，什么主题和主题计算机科学专业的毕业生需要吗现在就去学习，这样上面的可以解决需求，把这些框架要使用吗？

我想远离 Java，这是可能的，而且我没有其他语言偏好。我愿意学习并付出尽可能多的努力。

我的目的不是从头开始编写代码，而是首先将各种框架可供使用（尽管我不知道足以决定使用哪个框架），并且如果出现问题，我应该能够修复它们< /强>。

你对学习统计和概率论的特定部分的建议对我来说并不意外，所以如果需要的话就说吧！

如果需要，我会根据您的所有建议和反馈修改此问题。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

等待我真够勒 2024-10-02 00:41:27

机器学习中的“理解”相当于拥有一个模型。例如，模型可以是支持向量的集合、神经网络的布局和权重、决策树等。其中哪种方法最有效实际上取决于您正在学习的主题以及训练数据的质量。

在您的情况下，从 HTML 网站集合中学习，您会首先对数据进行预处理，此步骤也称为“特征提取”。也就是说，您从正在查看的页面中提取信息。这是一个困难的步骤，因为它需要领域知识，并且您必须提取有用的信息，否则您的分类器将无法做出很好的区分。特征提取将为您提供一个数据集（每行包含特征的矩阵），您可以从中创建模型。

一般来说，在机器学习中，建议还保留一个“测试集”，您不使用它来训练模型，但您将在最后使用它来决定什么是最佳方法。在建模步骤结束之前保持测试集隐藏是极其重要的！测试数据基本上会提示您模型正在产生的“泛化错误”。任何具有足够复杂性和学习时间的模型都倾向于准确地学习您训练它所用的信息。机器学习者表示该模型“过度拟合”了训练数据。这样的过拟合模型看起来似乎不错，但这只是记忆而已。

虽然对数据预处理的软件支持非常稀疏且高度依赖于领域，但正如 adam 提到的 Weka< /a> 是一个很好的免费工具，可以在获得数据集后应用不同的方法。我建议阅读几本书。 Vladimir Vapnik 撰写了《统计学习理论的本质》，他是 SVM 的发明者。你应该熟悉建模的过程，所以一本关于机器学习的书绝对是非常有用的。我还希望一些术语可能对您找到解决方法有所帮助。

"Understanding" in machine learn is the equivalent of having a model. The model can be for example a collection of support vectors, the layout and weights of a neural network, a decision tree, or more. Which of these methods work best really depends on the subject you're learning from and on the quality of your training data.

In your case, learning from a collection of HTML sites, you will like to preprocess the data first, this step is also called "feature extraction". That is, you extract information out of the page you're looking at. This is a difficult step, because it requires domain knowledge and you'll have to extract useful information, or otherwise your classifiers will not be able to make good distinctions. Feature extraction will give you a dataset (a matrix with features for each row) from which you'll be able to create your model.

Generally in machine learning it is advised to also keep a "test set" that you do not train your models with, but that you will use at the end to decide on what is the best method. It is of extreme importance that you keep the test set hidden until the very end of your modeling step! The test data basically gives you a hint on the "generalization error" that your model is making. Any model with enough complexity and learning time tends to learn exactly the information that you train it with. Machine learners say that the model "overfits" the training data. Such overfitted models seem to appear good, but this is just memorization.

While software support for preprocessing data is very sparse and highly domain dependent, as adam mentioned Weka is a good free tool for applying different methods once you have your dataset. I would recommend reading several books. Vladimir Vapnik wrote "The Nature of Statistical Learning Theory", he is the inventor of SVMs. You should get familiar with the process of modeling, so a book on machine learning is definitely very useful. I also hope that some of the terminology might be helpful to you in finding your way around.

回复收藏 0 原文