分层采样获得因子水平的奇数比例

发布于 2025-02-11 04:14:40 字数 4383 浏览 0 评论 0原文

我正在尝试使用UCI机器学习存储库的干豆数据集。我想通过数据集进行迭代,反复删除样品并运行分类器,以查看随着样本量减小的准确性的变化,但是首先,我制作循环并检查它的工作原理。

数据集:

> summary(DryBean)
      Area          Perimeter      MajorAxisLength MinorAxisLength  AspectRation    Eccentricity      ConvexArea     EquivDiameter       Extent          Solidity     
 Min.   : 20420   Min.   : 524.7   Min.   :183.6   Min.   :122.5   Min.   :1.025   Min.   :0.2190   Min.   : 20684   Min.   :161.2   Min.   :0.5553   Min.   :0.9192  
 1st Qu.: 36328   1st Qu.: 703.5   1st Qu.:253.3   1st Qu.:175.8   1st Qu.:1.432   1st Qu.:0.7159   1st Qu.: 36715   1st Qu.:215.1   1st Qu.:0.7186   1st Qu.:0.9857  
 Median : 44652   Median : 794.9   Median :296.9   Median :192.4   Median :1.551   Median :0.7644   Median : 45178   Median :238.4   Median :0.7599   Median :0.9883  
 Mean   : 53048   Mean   : 855.3   Mean   :320.1   Mean   :202.3   Mean   :1.583   Mean   :0.7509   Mean   : 53768   Mean   :253.1   Mean   :0.7497   Mean   :0.9871  
 3rd Qu.: 61332   3rd Qu.: 977.2   3rd Qu.:376.5   3rd Qu.:217.0   3rd Qu.:1.707   3rd Qu.:0.8105   3rd Qu.: 62294   3rd Qu.:279.4   3rd Qu.:0.7869   3rd Qu.:0.9900  
 Max.   :254616   Max.   :1985.4   Max.   :738.9   Max.   :460.2   Max.   :2.430   Max.   :0.9114   Max.   :263261   Max.   :569.4   Max.   :0.8662   Max.   :0.9947  
                                                                                                                                                                      
   roundness       Compactness      ShapeFactor1       ShapeFactor2        ShapeFactor3     ShapeFactor4         Class     
 Min.   :0.4896   Min.   :0.6406   Min.   :0.002778   Min.   :0.0005642   Min.   :0.4103   Min.   :0.9477   BARBUNYA:1322  
 1st Qu.:0.8321   1st Qu.:0.7625   1st Qu.:0.005900   1st Qu.:0.0011535   1st Qu.:0.5814   1st Qu.:0.9937   BOMBAY  : 522  
 Median :0.8832   Median :0.8013   Median :0.006645   Median :0.0016935   Median :0.6420   Median :0.9964   CALI    :1630  
 Mean   :0.8733   Mean   :0.7999   Mean   :0.006564   Mean   :0.0017159   Mean   :0.6436   Mean   :0.9951   DERMASON:3546  
 3rd Qu.:0.9169   3rd Qu.:0.8343   3rd Qu.:0.007271   3rd Qu.:0.0021703   3rd Qu.:0.6960   3rd Qu.:0.9979   HOROZ   :1928  
 Max.   :0.9907   Max.   :0.9873   Max.   :0.010451   Max.   :0.0036650   Max.   :0.9748   Max.   :0.9997   SEKER   :2027  
                                                                                                            SIRA    :2636  

我将类变量设置为一个因素,然后找到级别的比例:

> prop.table(table(DryBean$Class))

  BARBUNYA     BOMBAY       CALI   DERMASON      HOROZ      SEKER       SIRA 
0.09712732 0.03835133 0.11975608 0.26052458 0.14165014 0.14892366 0.19366689

我创建了数据的副本,然后使用循环以这些比例删除样品:

while(nrow(beanclone) > 100) {
  mysample <- stratified(beanclone, "Class", size = c("BARBUNYA" = 0.09712732 , "BOMBAY" = 0.03835133 , "CALI" = 0.11975608 , "DERMASON" = 0.26052458 , "HOROZ" = 0.14165014 , "SEKER" = 0.14892366 , "SIRA" = 0.19366689 ), keep.rownames = TRUE)
  beanclone <- beanclone[!seq_len(nrow(beanclone)) %in% mysample$rn,]
  print(table(beanclone$Class))
  print(nrow(beanclone))
}

但是,循环的前几个迭代具有此输出:

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
    1271      459     1205     2859     1655     1830     2243 
[1] 11522

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
    1222      404      891     2305     1421     1652     1909 
[1] 9804

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
    1175      356      659     1859     1220     1492     1625 
[1] 8386

Barbunya级别的丢失的样本比其他样本较慢,最终以此分布结束:

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
      80        4        1        2        3        5        3 
[1] 98

测试每个级别的大小设置为不同整数的测试表明,循环将正确的大小标签与级别相关联,即设置“ Barbunya”。 = 3会导致Barbunya在每次迭代时都会丢失3个样本,那么为什么按比例进行时会如此缓慢?

编辑:

更改一些比例小数已显示出一些意外的结果。 Barbunya 0.09712732-&gt; 0.5结束了循环:

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
     89        4        1        2        3        1        3 
[1] 103

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
     86        4        1        2        3        1        3 
[1] 100

这是每次迭代时要删除的Barbunya量的可忽略的增加。

I'm experimenting with the Dry Bean dataset from UCI machine learning repo. I want to iterate through the dataset, repeatedly removing samples and running classifiers to see how accuracy changes as sample size decreases, but first I make the loop and check it works.

The dataset:

> summary(DryBean)
      Area          Perimeter      MajorAxisLength MinorAxisLength  AspectRation    Eccentricity      ConvexArea     EquivDiameter       Extent          Solidity     
 Min.   : 20420   Min.   : 524.7   Min.   :183.6   Min.   :122.5   Min.   :1.025   Min.   :0.2190   Min.   : 20684   Min.   :161.2   Min.   :0.5553   Min.   :0.9192  
 1st Qu.: 36328   1st Qu.: 703.5   1st Qu.:253.3   1st Qu.:175.8   1st Qu.:1.432   1st Qu.:0.7159   1st Qu.: 36715   1st Qu.:215.1   1st Qu.:0.7186   1st Qu.:0.9857  
 Median : 44652   Median : 794.9   Median :296.9   Median :192.4   Median :1.551   Median :0.7644   Median : 45178   Median :238.4   Median :0.7599   Median :0.9883  
 Mean   : 53048   Mean   : 855.3   Mean   :320.1   Mean   :202.3   Mean   :1.583   Mean   :0.7509   Mean   : 53768   Mean   :253.1   Mean   :0.7497   Mean   :0.9871  
 3rd Qu.: 61332   3rd Qu.: 977.2   3rd Qu.:376.5   3rd Qu.:217.0   3rd Qu.:1.707   3rd Qu.:0.8105   3rd Qu.: 62294   3rd Qu.:279.4   3rd Qu.:0.7869   3rd Qu.:0.9900  
 Max.   :254616   Max.   :1985.4   Max.   :738.9   Max.   :460.2   Max.   :2.430   Max.   :0.9114   Max.   :263261   Max.   :569.4   Max.   :0.8662   Max.   :0.9947  
                                                                                                                                                                      
   roundness       Compactness      ShapeFactor1       ShapeFactor2        ShapeFactor3     ShapeFactor4         Class     
 Min.   :0.4896   Min.   :0.6406   Min.   :0.002778   Min.   :0.0005642   Min.   :0.4103   Min.   :0.9477   BARBUNYA:1322  
 1st Qu.:0.8321   1st Qu.:0.7625   1st Qu.:0.005900   1st Qu.:0.0011535   1st Qu.:0.5814   1st Qu.:0.9937   BOMBAY  : 522  
 Median :0.8832   Median :0.8013   Median :0.006645   Median :0.0016935   Median :0.6420   Median :0.9964   CALI    :1630  
 Mean   :0.8733   Mean   :0.7999   Mean   :0.006564   Mean   :0.0017159   Mean   :0.6436   Mean   :0.9951   DERMASON:3546  
 3rd Qu.:0.9169   3rd Qu.:0.8343   3rd Qu.:0.007271   3rd Qu.:0.0021703   3rd Qu.:0.6960   3rd Qu.:0.9979   HOROZ   :1928  
 Max.   :0.9907   Max.   :0.9873   Max.   :0.010451   Max.   :0.0036650   Max.   :0.9748   Max.   :0.9997   SEKER   :2027  
                                                                                                            SIRA    :2636  

I set the Class variable to a factor, then found the proportion of levels:

> prop.table(table(DryBean$Class))

  BARBUNYA     BOMBAY       CALI   DERMASON      HOROZ      SEKER       SIRA 
0.09712732 0.03835133 0.11975608 0.26052458 0.14165014 0.14892366 0.19366689

I created a copy of the data then used a loop to remove samples in those proportions:

while(nrow(beanclone) > 100) {
  mysample <- stratified(beanclone, "Class", size = c("BARBUNYA" = 0.09712732 , "BOMBAY" = 0.03835133 , "CALI" = 0.11975608 , "DERMASON" = 0.26052458 , "HOROZ" = 0.14165014 , "SEKER" = 0.14892366 , "SIRA" = 0.19366689 ), keep.rownames = TRUE)
  beanclone <- beanclone[!seq_len(nrow(beanclone)) %in% mysample$rn,]
  print(table(beanclone$Class))
  print(nrow(beanclone))
}

However, the first few iterations of the loop had this output:

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
    1271      459     1205     2859     1655     1830     2243 
[1] 11522

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
    1222      404      891     2305     1421     1652     1909 
[1] 9804

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
    1175      356      659     1859     1220     1492     1625 
[1] 8386

The Barbunya level is losing samples much more slowly than the rest, ultimately ending at this distribution:

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
      80        4        1        2        3        5        3 
[1] 98

Testing with the size being set to different integers for each level showed that the loop was associating the correct size label with the level, i.e. setting "Barbunya" = 3 would result in barbunya losing 3 samples at each iteration, so why is it changing so slowly when going by proportion?

Edit:

Changing some of the proportion decimals has shown some unexpected results.
Barbunya 0.09712732 -> 0.5 ended the loop at:

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
     89        4        1        2        3        1        3 
[1] 103

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
     86        4        1        2        3        1        3 
[1] 100

Which is a negligible increase in the amount of Barbunya being removed at each iteration.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文