当前位置：文江博客话题详情

支持向量机 (SVM) 的一些实现细节

发布于 2024-09-13 19:39:00 字数 514 浏览 2 评论 0原文

在一个特定的应用程序中，我需要机器学习（我知道我在本科课程中学习的东西）。我使用支持向量机并解决了问题。它工作正常。

现在我需要改进这个系统。这里的问题是

我每周都会获得额外的训练示例。现在，系统开始使用更新的示例（旧示例+新示例）重新进行训练。我想让它成为增量学习。使用以前的知识（而不是以前的示例）和新示例来获得新模型（知识）
对，我的训练示例有 3 个类。因此，每个训练示例都适合这 3 个类别之一。我想要“未知”类的功能。任何不符合这 3 个类别的内容都必须标记为“未知”。但我不能将“Unknown”视为一个新类并为此提供示例。
假设“未知”类已实现。当类“未知”时，应用程序的用户输入他认为该类可能是什么。现在，我需要将用户输入纳入学习中。我也不知道如何做到这一点。如果用户输入一个新类（即训练集中尚未存在的类），会有什么不同吗？

我需要选择新的算法还是支持向量机可以做到这一点？

PS：我正在使用 libsvm 实现 SVM。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怪我鬧 2024-09-20 19:39:00

我刚刚使用与您的问题相同的组织编写了我的答案（1.、2.、3）。

SVM 可以做到这一点——即增量学习吗？多层感知器当然可以——因为后续的训练实例不会影响基本网络架构，它们只会导致权重矩阵值的调整。但是支持向量机呢？在我看来，（理论上）一个额外的训练实例可以改变支持向量的选择。但我还是不知道。
我认为通过将 LIBSVM 配置为一对多，即作为一类分类器，您可以很容易地解决这个问题。 SVM是一类分类器；将支持向量机应用于多类意味着它已被编码为执行多个、逐步的一对多分类，但该算法再次一次训练（和测试）一类。如果这样做，那么在针对测试集逐步执行后剩下的内容是“未知的”——换句话说，在执行多个连续的一类分类后未分类的任何数据，根据定义是“未知的” ' 班级。
为什么不让用户的猜测成为一个特征（即，只是另一个因变量）？唯一的其他选择是使其成为类标签本身，但您不希望这样做。因此，例如，您可以在数据矩阵“用户类别猜测”中添加一列，然后用一些最有可能对不属于“未知”类别的数据点没有影响的值填充它，因此用户可以使用这些值。不会提供猜测 - 该值可能是“0”或“1”，但实际上这取决于您如何缩放和规范化数据）。

回复收藏 0 原文

逆光飞翔i 2024-09-20 19:39:00

您的第一项可能是最困难的，因为基本上不存在良好的增量 SVM 实现。

几个月前，我还研究了在线或增量 SVM 算法。不幸的是，目前的实现状态相当稀疏。我发现的只是一个 Matlab 示例，OnlineSVR（仅实现回归支持的论文项目），以及 SVMHeavy（仅二进制类支持）。

我个人没有使用过它们中的任何一个。它们似乎都处于“研究玩具”阶段。我什至无法编译 SVMHeavy。

目前，您可能可以通过定期批量训练来合并更新。我还使用 LibSVM，它的速度相当快，因此在实现适当的增量版本之前，它应该是一个很好的替代品。

我也不认为 SVM 默认可以模拟“未知”样本的概念。它们通常作为一系列布尔分类器工作，因此样本最终会被肯定地分类为某物，即使该样本与之前看到的任何样本都截然不同。一种可能的解决方法是对特征范围进行建模，并随机生成这些范围之外的样本，然后将它们添加到训练集中。

例如，如果您有一个名为“颜色”的属性，其最小值为 4，最大值为 123，那么您可以将这些添加到您的训练集中，

[({'color':3},'unknown'),({'color':125},'unknown')]

以使您的 SVM 了解“未知”颜色的含义。

Your first item will likely be the most difficult, since there are essentially no good incremental SVM implementations in existence.

A few months ago, I also researched online or incremental SVM algorithms. Unfortunately, the current state of implementations is quite sparse. All I found was a Matlab example, OnlineSVR (a thesis project only implementing regression support), and SVMHeavy (only binary class support).

I haven't used any of them personally. They all appear to be at the "research toy" stage. I couldn't even get SVMHeavy to compile.

For now, you can probably get away with doing periodic batch training to incorporate updates. I also use LibSVM, and it's quite fast, so it sould be a good substitute until a proper incremental version is implemented.

I also don't think SVM's can model the concept of an "unknown" sample by default. They typically work as a series of boolean classifiers, so a sample ends up as positively being classified as something, even if that sample is drastically different from anything seen previously. A possible workaround would be to model the ranges of your features, and randomly generate samples that exist outside of these ranges, and then add these to your training set.

For example, if you have an attribute called "color", which has a minimum value of 4 and a maximum value of 123, then you could add these to your training set

[({'color':3},'unknown'),({'color':125},'unknown')]

to give your SVM an idea of what an "unknown" color means.

回复收藏 0 原文

莳間冲淡了誓言ζ 2024-09-20 19:39:00

有一些算法可以增量训练 SVM，但我不认为 libSVM 实现了这一点。我认为你应该考虑一下你是否真的需要这个功能。我认为你目前的方法没有问题，除非训练过程真的太慢了。如果是，您能否分批重新训练（即每 100 个新示例之后）？
您可以使用 libSVM 来生成类成员资格的概率。我认为这可以用于多类分类，但我对此并不完全确定。您需要确定分类不够确定的某个阈值，然后输出“未知”。我想像在最有可能的类别和第二最有可能的类别之间的差异上设置阈值之类的东西可以实现这一点。
我认为 libSVM 可以扩展到任意数量的新类。然而，添加新类很可能会影响模型的准确性。

回复收藏 0 原文

幽梦紫曦～ 2024-09-20 19:39:00

尽管这个问题可能已经过时了，但我觉得有必要提出一些额外的想法。

由于你的第一个问题已经被其他人回答了（没有实现增量学习的生产就绪的SVM，尽管这是可能的），我将跳过它。 ;)
将“未知”添加为类并不是一个好主意。根据用途不同，原因也不同。
- 如果您使用“未知”类作为“此实例尚未分类，但属于已知类之一”的标签，那么您的 SVM 就会遇到大麻烦。原因是 libsvm 构建了多个二元分类器并将它们组合起来。因此，如果您有三个类别 - 假设 A、B 和 C - SVM 通过将训练示例分为“分类为 A”和“任何其他类别”来构建第一个二元分类器。后者显然将包含“未知”类中的所有示例。当尝试构建超平面时，“未知”中的示例（实际上属于“A”类）可能会导致 SVM 构建一个边距非常小的超平面，并且很难识别 A 的未来实例，即它的泛化性能会减少。这是因为，SVM 将尝试构建一个超平面，将 A 的大多数实例（正式标记为“A”的实例）分隔到超平面的一侧，并将一些实例（正式标记为“未知”的实例）分隔到超平面的一侧。另一边。
- 如果您使用“未知”类来存储 SVM 尚不知道其类的所有示例，则会出现另一个问题。例如，SVM 知道类 A、B 和 C，但您最近获得了两个新类 D 和 E 的示例数据。由于这些示例未分类并且 SVM 不知道新类，因此您可能需要临时存储他们在“未知”中。在这种情况下，“未知”类可能会引起麻烦，因为它可能包含其特征值存在巨大变化的示例。这将使得创建良好的分离超平面变得非常困难，因此生成的分类器将很难将 D 或 E 的新实例识别为“未知”。可能属于 A、B 或 C 的新实例的分类也会受到阻碍。
综上所述：引入包含已知类示例或多个新类示例的“未知”类将导致分类器性能较差。我认为在训练分类器时最好忽略所有未分类的实例。
我建议您在分类算法之外解决这个问题。我自己被要求提供此功能，并实现了一个网页，其中显示了相关对象的图像以及每个已知类的按钮。如果相关对象属于未知的类，用户可以填写另一个表格来添加新类。如果他返回分类页面，该类别的另一个按钮将会神奇地出现。实例分类后，可以将它们用于训练分类器。（我使用数据库来存储已知的类并引用哪个示例属于哪个类。我实现了一个导出函数以使数据支持 SVM。）

Even though this question is probably out of date, I feel obliged to give some additional thoughts.

Since your first question has been answered by others (there is no production-ready SVM which implements incremental learning, even though it is possible), I will skip it. ;)
Adding 'Unknown' as a class is not a good idea. Depending on it's use, the reasons are different.
- If you are using the 'Unknown' class as a tag for "this instance has not been classified, but belongs to one of the known classes", then your SVM is in deep trouble. The reason is, that libsvm builds several binary classifiers and combines them. So if you have three classes - let's say A, B and C - the SVM builds the first binary classifier by splitting the training examples into "classified as A" and "any other class". The latter will obviously contain all examples from the 'Unknown' class. When trying to build a hyperplane, examples in 'Unknown' (which really belong to the class 'A') will probably cause the SVM to build a hyperplane with a very small margin and will poorly recognizes future instances of A, i.e. it's generalization performance will diminish. That's due to the fact, that the SVM will try to build a hyperplane which separates most instances of A (those officially labeled as 'A') onto one side of the hyperplane and some instances (those officially labeled as 'Unknown') on the other side .
- Another problem occurs if you are using the 'Unknown' class to store all examples, whose class is not yet known to the SVM. For example, the SVM knows the classes A, B and C, but you recently got example data for two new classes D and E. Since these examples are not classified and the new classes not known to the SVM, you may want to temporarily store them in 'Unknown'. In that case the 'Unknown' class may cause trouble, since it possibly contains examples with enormous variation in the values of it's features. That will make it very hard to create good separating hyperplanes and therefore the resulting classifier will poorly recognize new instances of D or E as 'Unknown'. Probably the classification of new instances belonging to A, B or C will be hindered as well.
To sum up: Introducing an 'Unknown' class which contains examples of known classes or examples of several new classes will result in a poor classifier. I think it's best to ignore all unclassified instances when training the classifier.
I would recommend, that you solve this issue outside the classification algorithm. I was asked for this feature myself and implemented a single webpage, which shows an image of the object in question and a button for each known class. If the object in question belongs to a class which is not known yet, the user can fill out another form to add a new class. If he goes back to the classification page, another button for that class will magically appear. After the instances have been classified, they can be used for training the classifier. (I used a database to store the known classes and reference which example belongs to which class. I implemented an export function to make the data SVM-ready.)

回复收藏 0 原文

~没有更多了~