分层采样获得因子水平的奇数比例

发布于 2025-02-11 04:14:40 字数 4383 浏览 0 评论 0原文

我正在尝试使用UCI机器学习存储库的干豆数据集。我想通过数据集进行迭代，反复删除样品并运行分类器，以查看随着样本量减小的准确性的变化，但是首先，我制作循环并检查它的工作原理。

数据集：

> summary(DryBean)
      Area          Perimeter      MajorAxisLength MinorAxisLength  AspectRation    Eccentricity      ConvexArea     EquivDiameter       Extent          Solidity     
 Min.   : 20420   Min.   : 524.7   Min.   :183.6   Min.   :122.5   Min.   :1.025   Min.   :0.2190   Min.   : 20684   Min.   :161.2   Min.   :0.5553   Min.   :0.9192  
 1st Qu.: 36328   1st Qu.: 703.5   1st Qu.:253.3   1st Qu.:175.8   1st Qu.:1.432   1st Qu.:0.7159   1st Qu.: 36715   1st Qu.:215.1   1st Qu.:0.7186   1st Qu.:0.9857  
 Median : 44652   Median : 794.9   Median :296.9   Median :192.4   Median :1.551   Median :0.7644   Median : 45178   Median :238.4   Median :0.7599   Median :0.9883  
 Mean   : 53048   Mean   : 855.3   Mean   :320.1   Mean   :202.3   Mean   :1.583   Mean   :0.7509   Mean   : 53768   Mean   :253.1   Mean   :0.7497   Mean   :0.9871  
 3rd Qu.: 61332   3rd Qu.: 977.2   3rd Qu.:376.5   3rd Qu.:217.0   3rd Qu.:1.707   3rd Qu.:0.8105   3rd Qu.: 62294   3rd Qu.:279.4   3rd Qu.:0.7869   3rd Qu.:0.9900  
 Max.   :254616   Max.   :1985.4   Max.   :738.9   Max.   :460.2   Max.   :2.430   Max.   :0.9114   Max.   :263261   Max.   :569.4   Max.   :0.8662   Max.   :0.9947  
                                                                                                                                                                      
   roundness       Compactness      ShapeFactor1       ShapeFactor2        ShapeFactor3     ShapeFactor4         Class     
 Min.   :0.4896   Min.   :0.6406   Min.   :0.002778   Min.   :0.0005642   Min.   :0.4103   Min.   :0.9477   BARBUNYA:1322  
 1st Qu.:0.8321   1st Qu.:0.7625   1st Qu.:0.005900   1st Qu.:0.0011535   1st Qu.:0.5814   1st Qu.:0.9937   BOMBAY  : 522  
 Median :0.8832   Median :0.8013   Median :0.006645   Median :0.0016935   Median :0.6420   Median :0.9964   CALI    :1630  
 Mean   :0.8733   Mean   :0.7999   Mean   :0.006564   Mean   :0.0017159   Mean   :0.6436   Mean   :0.9951   DERMASON:3546  
 3rd Qu.:0.9169   3rd Qu.:0.8343   3rd Qu.:0.007271   3rd Qu.:0.0021703   3rd Qu.:0.6960   3rd Qu.:0.9979   HOROZ   :1928  
 Max.   :0.9907   Max.   :0.9873   Max.   :0.010451   Max.   :0.0036650   Max.   :0.9748   Max.   :0.9997   SEKER   :2027  
                                                                                                            SIRA    :2636

我将类变量设置为一个因素，然后找到级别的比例：

> prop.table(table(DryBean$Class))

  BARBUNYA     BOMBAY       CALI   DERMASON      HOROZ      SEKER       SIRA 
0.09712732 0.03835133 0.11975608 0.26052458 0.14165014 0.14892366 0.19366689

我创建了数据的副本，然后使用循环以这些比例删除样品：

while(nrow(beanclone) > 100) {
  mysample <- stratified(beanclone, "Class", size = c("BARBUNYA" = 0.09712732 , "BOMBAY" = 0.03835133 , "CALI" = 0.11975608 , "DERMASON" = 0.26052458 , "HOROZ" = 0.14165014 , "SEKER" = 0.14892366 , "SIRA" = 0.19366689 ), keep.rownames = TRUE)
  beanclone <- beanclone[!seq_len(nrow(beanclone)) %in% mysample$rn,]
  print(table(beanclone$Class))
  print(nrow(beanclone))
}

但是，循环的前几个迭代具有此输出：

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
    1271      459     1205     2859     1655     1830     2243 
[1] 11522

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
    1222      404      891     2305     1421     1652     1909 
[1] 9804

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
    1175      356      659     1859     1220     1492     1625 
[1] 8386

Barbunya级别的丢失的样本比其他样本较慢，最终以此分布结束：

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
      80        4        1        2        3        5        3 
[1] 98

测试每个级别的大小设置为不同整数的测试表明，循环将正确的大小标签与级别相关联，即设置“ Barbunya”。 = 3会导致Barbunya在每次迭代时都会丢失3个样本，那么为什么按比例进行时会如此缓慢？

编辑：

更改一些比例小数已显示出一些意外的结果。 Barbunya 0.09712732-＆gt; 0.5结束了循环：

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
     89        4        1        2        3        1        3 
[1] 103

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
     86        4        1        2        3        1        3 
[1] 100

这是每次迭代时要删除的Barbunya量的可忽略的增加。

原文

I'm experimenting with the Dry Bean dataset from UCI machine learning repo. I want to iterate through the dataset, repeatedly removing samples and running classifiers to see how accuracy changes as sample size decreases, but first I make the loop and check it works.

The dataset:

> summary(DryBean)
      Area          Perimeter      MajorAxisLength MinorAxisLength  AspectRation    Eccentricity      ConvexArea     EquivDiameter       Extent          Solidity     
 Min.   : 20420   Min.   : 524.7   Min.   :183.6   Min.   :122.5   Min.   :1.025   Min.   :0.2190   Min.   : 20684   Min.   :161.2   Min.   :0.5553   Min.   :0.9192  
 1st Qu.: 36328   1st Qu.: 703.5   1st Qu.:253.3   1st Qu.:175.8   1st Qu.:1.432   1st Qu.:0.7159   1st Qu.: 36715   1st Qu.:215.1   1st Qu.:0.7186   1st Qu.:0.9857  
 Median : 44652   Median : 794.9   Median :296.9   Median :192.4   Median :1.551   Median :0.7644   Median : 45178   Median :238.4   Median :0.7599   Median :0.9883  
 Mean   : 53048   Mean   : 855.3   Mean   :320.1   Mean   :202.3   Mean   :1.583   Mean   :0.7509   Mean   : 53768   Mean   :253.1   Mean   :0.7497   Mean   :0.9871  
 3rd Qu.: 61332   3rd Qu.: 977.2   3rd Qu.:376.5   3rd Qu.:217.0   3rd Qu.:1.707   3rd Qu.:0.8105   3rd Qu.: 62294   3rd Qu.:279.4   3rd Qu.:0.7869   3rd Qu.:0.9900  
 Max.   :254616   Max.   :1985.4   Max.   :738.9   Max.   :460.2   Max.   :2.430   Max.   :0.9114   Max.   :263261   Max.   :569.4   Max.   :0.8662   Max.   :0.9947  
                                                                                                                                                                      
   roundness       Compactness      ShapeFactor1       ShapeFactor2        ShapeFactor3     ShapeFactor4         Class     
 Min.   :0.4896   Min.   :0.6406   Min.   :0.002778   Min.   :0.0005642   Min.   :0.4103   Min.   :0.9477   BARBUNYA:1322  
 1st Qu.:0.8321   1st Qu.:0.7625   1st Qu.:0.005900   1st Qu.:0.0011535   1st Qu.:0.5814   1st Qu.:0.9937   BOMBAY  : 522  
 Median :0.8832   Median :0.8013   Median :0.006645   Median :0.0016935   Median :0.6420   Median :0.9964   CALI    :1630  
 Mean   :0.8733   Mean   :0.7999   Mean   :0.006564   Mean   :0.0017159   Mean   :0.6436   Mean   :0.9951   DERMASON:3546  
 3rd Qu.:0.9169   3rd Qu.:0.8343   3rd Qu.:0.007271   3rd Qu.:0.0021703   3rd Qu.:0.6960   3rd Qu.:0.9979   HOROZ   :1928  
 Max.   :0.9907   Max.   :0.9873   Max.   :0.010451   Max.   :0.0036650   Max.   :0.9748   Max.   :0.9997   SEKER   :2027  
                                                                                                            SIRA    :2636

I set the Class variable to a factor, then found the proportion of levels:

> prop.table(table(DryBean$Class))

  BARBUNYA     BOMBAY       CALI   DERMASON      HOROZ      SEKER       SIRA 
0.09712732 0.03835133 0.11975608 0.26052458 0.14165014 0.14892366 0.19366689

I created a copy of the data then used a loop to remove samples in those proportions:

while(nrow(beanclone) > 100) {
  mysample <- stratified(beanclone, "Class", size = c("BARBUNYA" = 0.09712732 , "BOMBAY" = 0.03835133 , "CALI" = 0.11975608 , "DERMASON" = 0.26052458 , "HOROZ" = 0.14165014 , "SEKER" = 0.14892366 , "SIRA" = 0.19366689 ), keep.rownames = TRUE)
  beanclone <- beanclone[!seq_len(nrow(beanclone)) %in% mysample$rn,]
  print(table(beanclone$Class))
  print(nrow(beanclone))
}

However, the first few iterations of the loop had this output:

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
    1271      459     1205     2859     1655     1830     2243 
[1] 11522

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
    1222      404      891     2305     1421     1652     1909 
[1] 9804

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
    1175      356      659     1859     1220     1492     1625 
[1] 8386

The Barbunya level is losing samples much more slowly than the rest, ultimately ending at this distribution:

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
      80        4        1        2        3        5        3 
[1] 98

Testing with the size being set to different integers for each level showed that the loop was associating the correct size label with the level, i.e. setting "Barbunya" = 3 would result in barbunya losing 3 samples at each iteration, so why is it changing so slowly when going by proportion?

Edit:

Changing some of the proportion decimals has shown some unexpected results.
Barbunya 0.09712732 -> 0.5 ended the loop at:

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
     89        4        1        2        3        1        3 
[1] 103

BARBUNYA   BOMBAY     CALI DERMASON    HOROZ    SEKER     SIRA 
     86        4        1        2        3        1        3 
[1] 100

Which is a negligible increase in the amount of Barbunya being removed at each iteration.

分享到QQ

分享到微博