分层采样获得因子水平的奇数比例
我正在尝试使用UCI机器学习存储库的干豆数据集。我想通过数据集进行迭代,反复删除样品并运行分类器,以查看随着样本量减小的准确性的变化,但是首先,我制作循环并检查它的工作原理。
数据集:
> summary(DryBean)
Area Perimeter MajorAxisLength MinorAxisLength AspectRation Eccentricity ConvexArea EquivDiameter Extent Solidity
Min. : 20420 Min. : 524.7 Min. :183.6 Min. :122.5 Min. :1.025 Min. :0.2190 Min. : 20684 Min. :161.2 Min. :0.5553 Min. :0.9192
1st Qu.: 36328 1st Qu.: 703.5 1st Qu.:253.3 1st Qu.:175.8 1st Qu.:1.432 1st Qu.:0.7159 1st Qu.: 36715 1st Qu.:215.1 1st Qu.:0.7186 1st Qu.:0.9857
Median : 44652 Median : 794.9 Median :296.9 Median :192.4 Median :1.551 Median :0.7644 Median : 45178 Median :238.4 Median :0.7599 Median :0.9883
Mean : 53048 Mean : 855.3 Mean :320.1 Mean :202.3 Mean :1.583 Mean :0.7509 Mean : 53768 Mean :253.1 Mean :0.7497 Mean :0.9871
3rd Qu.: 61332 3rd Qu.: 977.2 3rd Qu.:376.5 3rd Qu.:217.0 3rd Qu.:1.707 3rd Qu.:0.8105 3rd Qu.: 62294 3rd Qu.:279.4 3rd Qu.:0.7869 3rd Qu.:0.9900
Max. :254616 Max. :1985.4 Max. :738.9 Max. :460.2 Max. :2.430 Max. :0.9114 Max. :263261 Max. :569.4 Max. :0.8662 Max. :0.9947
roundness Compactness ShapeFactor1 ShapeFactor2 ShapeFactor3 ShapeFactor4 Class
Min. :0.4896 Min. :0.6406 Min. :0.002778 Min. :0.0005642 Min. :0.4103 Min. :0.9477 BARBUNYA:1322
1st Qu.:0.8321 1st Qu.:0.7625 1st Qu.:0.005900 1st Qu.:0.0011535 1st Qu.:0.5814 1st Qu.:0.9937 BOMBAY : 522
Median :0.8832 Median :0.8013 Median :0.006645 Median :0.0016935 Median :0.6420 Median :0.9964 CALI :1630
Mean :0.8733 Mean :0.7999 Mean :0.006564 Mean :0.0017159 Mean :0.6436 Mean :0.9951 DERMASON:3546
3rd Qu.:0.9169 3rd Qu.:0.8343 3rd Qu.:0.007271 3rd Qu.:0.0021703 3rd Qu.:0.6960 3rd Qu.:0.9979 HOROZ :1928
Max. :0.9907 Max. :0.9873 Max. :0.010451 Max. :0.0036650 Max. :0.9748 Max. :0.9997 SEKER :2027
SIRA :2636
我将类变量设置为一个因素,然后找到级别的比例:
> prop.table(table(DryBean$Class))
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
0.09712732 0.03835133 0.11975608 0.26052458 0.14165014 0.14892366 0.19366689
我创建了数据的副本,然后使用循环以这些比例删除样品:
while(nrow(beanclone) > 100) {
mysample <- stratified(beanclone, "Class", size = c("BARBUNYA" = 0.09712732 , "BOMBAY" = 0.03835133 , "CALI" = 0.11975608 , "DERMASON" = 0.26052458 , "HOROZ" = 0.14165014 , "SEKER" = 0.14892366 , "SIRA" = 0.19366689 ), keep.rownames = TRUE)
beanclone <- beanclone[!seq_len(nrow(beanclone)) %in% mysample$rn,]
print(table(beanclone$Class))
print(nrow(beanclone))
}
但是,循环的前几个迭代具有此输出:
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
1271 459 1205 2859 1655 1830 2243
[1] 11522
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
1222 404 891 2305 1421 1652 1909
[1] 9804
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
1175 356 659 1859 1220 1492 1625
[1] 8386
Barbunya级别的丢失的样本比其他样本较慢,最终以此分布结束:
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
80 4 1 2 3 5 3
[1] 98
测试每个级别的大小设置为不同整数的测试表明,循环将正确的大小标签与级别相关联,即设置“ Barbunya”。 = 3会导致Barbunya在每次迭代时都会丢失3个样本,那么为什么按比例进行时会如此缓慢?
编辑:
更改一些比例小数已显示出一些意外的结果。 Barbunya 0.09712732-&gt; 0.5结束了循环:
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
89 4 1 2 3 1 3
[1] 103
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
86 4 1 2 3 1 3
[1] 100
这是每次迭代时要删除的Barbunya量的可忽略的增加。
I'm experimenting with the Dry Bean dataset from UCI machine learning repo. I want to iterate through the dataset, repeatedly removing samples and running classifiers to see how accuracy changes as sample size decreases, but first I make the loop and check it works.
The dataset:
> summary(DryBean)
Area Perimeter MajorAxisLength MinorAxisLength AspectRation Eccentricity ConvexArea EquivDiameter Extent Solidity
Min. : 20420 Min. : 524.7 Min. :183.6 Min. :122.5 Min. :1.025 Min. :0.2190 Min. : 20684 Min. :161.2 Min. :0.5553 Min. :0.9192
1st Qu.: 36328 1st Qu.: 703.5 1st Qu.:253.3 1st Qu.:175.8 1st Qu.:1.432 1st Qu.:0.7159 1st Qu.: 36715 1st Qu.:215.1 1st Qu.:0.7186 1st Qu.:0.9857
Median : 44652 Median : 794.9 Median :296.9 Median :192.4 Median :1.551 Median :0.7644 Median : 45178 Median :238.4 Median :0.7599 Median :0.9883
Mean : 53048 Mean : 855.3 Mean :320.1 Mean :202.3 Mean :1.583 Mean :0.7509 Mean : 53768 Mean :253.1 Mean :0.7497 Mean :0.9871
3rd Qu.: 61332 3rd Qu.: 977.2 3rd Qu.:376.5 3rd Qu.:217.0 3rd Qu.:1.707 3rd Qu.:0.8105 3rd Qu.: 62294 3rd Qu.:279.4 3rd Qu.:0.7869 3rd Qu.:0.9900
Max. :254616 Max. :1985.4 Max. :738.9 Max. :460.2 Max. :2.430 Max. :0.9114 Max. :263261 Max. :569.4 Max. :0.8662 Max. :0.9947
roundness Compactness ShapeFactor1 ShapeFactor2 ShapeFactor3 ShapeFactor4 Class
Min. :0.4896 Min. :0.6406 Min. :0.002778 Min. :0.0005642 Min. :0.4103 Min. :0.9477 BARBUNYA:1322
1st Qu.:0.8321 1st Qu.:0.7625 1st Qu.:0.005900 1st Qu.:0.0011535 1st Qu.:0.5814 1st Qu.:0.9937 BOMBAY : 522
Median :0.8832 Median :0.8013 Median :0.006645 Median :0.0016935 Median :0.6420 Median :0.9964 CALI :1630
Mean :0.8733 Mean :0.7999 Mean :0.006564 Mean :0.0017159 Mean :0.6436 Mean :0.9951 DERMASON:3546
3rd Qu.:0.9169 3rd Qu.:0.8343 3rd Qu.:0.007271 3rd Qu.:0.0021703 3rd Qu.:0.6960 3rd Qu.:0.9979 HOROZ :1928
Max. :0.9907 Max. :0.9873 Max. :0.010451 Max. :0.0036650 Max. :0.9748 Max. :0.9997 SEKER :2027
SIRA :2636
I set the Class variable to a factor, then found the proportion of levels:
> prop.table(table(DryBean$Class))
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
0.09712732 0.03835133 0.11975608 0.26052458 0.14165014 0.14892366 0.19366689
I created a copy of the data then used a loop to remove samples in those proportions:
while(nrow(beanclone) > 100) {
mysample <- stratified(beanclone, "Class", size = c("BARBUNYA" = 0.09712732 , "BOMBAY" = 0.03835133 , "CALI" = 0.11975608 , "DERMASON" = 0.26052458 , "HOROZ" = 0.14165014 , "SEKER" = 0.14892366 , "SIRA" = 0.19366689 ), keep.rownames = TRUE)
beanclone <- beanclone[!seq_len(nrow(beanclone)) %in% mysample$rn,]
print(table(beanclone$Class))
print(nrow(beanclone))
}
However, the first few iterations of the loop had this output:
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
1271 459 1205 2859 1655 1830 2243
[1] 11522
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
1222 404 891 2305 1421 1652 1909
[1] 9804
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
1175 356 659 1859 1220 1492 1625
[1] 8386
The Barbunya level is losing samples much more slowly than the rest, ultimately ending at this distribution:
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
80 4 1 2 3 5 3
[1] 98
Testing with the size being set to different integers for each level showed that the loop was associating the correct size label with the level, i.e. setting "Barbunya" = 3 would result in barbunya losing 3 samples at each iteration, so why is it changing so slowly when going by proportion?
Edit:
Changing some of the proportion decimals has shown some unexpected results.
Barbunya 0.09712732 -> 0.5 ended the loop at:
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
89 4 1 2 3 1 3
[1] 103
BARBUNYA BOMBAY CALI DERMASON HOROZ SEKER SIRA
86 4 1 2 3 1 3
[1] 100
Which is a negligible increase in the amount of Barbunya being removed at each iteration.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论