如何在 R 中对数据集中的实例进行过采样
我有一个包含 20 个类的数据集,并且它的分布非常不均匀。 R 中是否有任何功能可以让我们平衡数据集(也许是加权的)?
我想使用Weka的平衡数据进行分类。由于我的班级分布是倾斜的,如果没有单一多数班级,我希望能获得更好的结果。
我尝试过使用 SMOTE 过滤器和重新采样过滤器,但它们并不能完全满足我的要求。 我不想删除任何实例,重复就可以了。
I have a data set with 20 classes, and it has a pretty non-uniform distribution. Is there any functionality in R that allows us to balance the data set (weighted perhaps)?
I want to use the balanced data with Weka for classification. Since my class distribution is skewed, I am hoping to get better results if there's no single majority class.
I have tried to use the SMOTE filter and Resample filter but they don't quite do what I want.
I dont want any instances to be removed, repetition is fine.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为您的术语存在误解。您的问题标题涉及抽样,但问题文本涉及加权。
澄清一下:
通过采样,您要么拥有比原始集中更少、相同或更多的实例;要么拥有比原始集中更少、相同或更多的实例。样本的唯一成员资格可以是原始集的严格子集,也可以与原始集相同(带有替换 - 即重复项)。
通过加权,您只需调整可用于某些进一步目的(例如采样、机器学习)的权重,以解决或施加相对于统一加权的某种(不)平衡。
我相信您指的是加权,但相同的答案应该适用于这两种情况。如果观察总数为
N
并且每个类别的频率是 20 长向量freq
的一个元素(例如,类别 1 中的项目数为freq[1]*N
),然后只需使用1/freq
权重向量来标准化权重。您可以按某个常量(例如N
)对其进行缩放,但这并不重要。如果任何频率为 0 或非常接近它,您可以通过使用平滑计数向量(例如 Good-Turing 平滑)来解决此问题。因此,每组重量占总重量的比例相同。
I think there's a misunderstanding in your terminology. Your question's title refers to sampling, and yet the question text involves weighting.
To clarify:
With sampling, you either have fewer, the same, or more instances than in the original set; the unique membership of a sample can be either a strict subset of the original set or can be identical to the original set (with replacement - i.e., duplicates).
By weighting, you simply adjust weights that may be used for some further purpose (e.g. sampling, machine learning) to address or impose some (im)balance relative to a uniform weighting.
I believe that you are referring to weighting, but the same answer should work in both cases. If the total # of observations is
N
and the frequency of each class is an element of the 20-long vectorfreq
(e.g. the count of items in class 1 isfreq[1]*N
), then simply use a weight vector of1/freq
to normalize the weights. You can scale it by some constant, e.g.N
, though it wouldn't matter. In case any frequency is 0 or very close to it, you might address this by using a vector of smoothed counts (e.g. Good-Turing smoothing).As a result, each set will have an equal proportion of the total weight.