如何在 R 中对数据集中的实例进行过采样

发布于 2024-11-26 18:02:38 字数 197 浏览 0 评论 0原文

我有一个包含 20 个类的数据集,并且它的分布非常不均匀。 R 中是否有任何功能可以让我们平衡数据集(也许是加权的)?

我想使用Weka的平衡数据进行分类。由于我的班级分布是倾斜的,如果没有单一多数班级,我希望能获得更好的结果。

我尝试过使用 SMOTE 过滤器和重新采样过滤器,但它们并不能完全满足我的要求。 我不想删除任何实例,重复就可以了。

I have a data set with 20 classes, and it has a pretty non-uniform distribution. Is there any functionality in R that allows us to balance the data set (weighted perhaps)?

I want to use the balanced data with Weka for classification. Since my class distribution is skewed, I am hoping to get better results if there's no single majority class.

I have tried to use the SMOTE filter and Resample filter but they don't quite do what I want.
I dont want any instances to be removed, repetition is fine.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

长发绾君心 2024-12-03 18:02:38

我认为您的术语存在误解。您的问题标题涉及抽样,但问题文本涉及加权。

澄清一下:

通过采样,您要么拥有比原始集中更少、相同或更多的实例;要么拥有比原始集中更少、相同或更多的实例。样本的唯一成员资格可以是原始集的严格子集,也可以与原始集相同(带有替换 - 即重复项)。

通过加权,您只需调整可用于某些进一步目的(例如采样、机器学习)的权重,以解决或施加相对于统一加权的某种(不)平衡。

我相信您指的是加权,但相同的答案应该适用于这两种情况。如果观察总数为 N 并且每个类别的频率是 20 长向量 freq 的一个元素(例如,类别 1 中的项目数为 freq[1]*N),然后只需使用 1/freq 权重向量来标准化权重。您可以按某个常量(例如N)对其进行缩放,但这并不重要。如果任何频率为 0 或非常接近它,您可以通过使用平滑计数向量(例如 Good-Turing 平滑)来解决此问题。

因此,每组重量占总重量的比例相同。

I think there's a misunderstanding in your terminology. Your question's title refers to sampling, and yet the question text involves weighting.

To clarify:

With sampling, you either have fewer, the same, or more instances than in the original set; the unique membership of a sample can be either a strict subset of the original set or can be identical to the original set (with replacement - i.e., duplicates).

By weighting, you simply adjust weights that may be used for some further purpose (e.g. sampling, machine learning) to address or impose some (im)balance relative to a uniform weighting.

I believe that you are referring to weighting, but the same answer should work in both cases. If the total # of observations is N and the frequency of each class is an element of the 20-long vector freq (e.g. the count of items in class 1 is freq[1]*N), then simply use a weight vector of 1/freq to normalize the weights. You can scale it by some constant, e.g. N, though it wouldn't matter. In case any frequency is 0 or very close to it, you might address this by using a vector of smoothed counts (e.g. Good-Turing smoothing).

As a result, each set will have an equal proportion of the total weight.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文