划分类:jenks 与 kmeans

发布于 2024-10-22 02:03:59 字数 835 浏览 2 评论 0原文

我想将一个向量(长度约为 10^5)分为五个类。通过包 classInt 中的函数 classIntervals,我想使用 style = "jenks" 自然中断,但这即使对于向量小得多,只有 500。设置 style = "kmeans" 几乎立即执行。

library(classInt)

my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)

system.time(classIntervals(x, n = 5, style = "jenks"))
R> system.time(classIntervals(x, n = 5, style = "jenks"))
   user  system elapsed 
  13.46    0.00   13.45 

system.time(classIntervals(x, n = 5, style = "kmeans"))
R> system.time(classIntervals(x, n = 5, style = "kmeans"))
   user  system elapsed 
   0.02    0.00    0.02

是什么让 Jenks 算法如此缓慢,是否有更快的方法来运行它?

如果需要,我会将问题的最后两部分移至 stats.stackexchange.com:

  • 在什么情况下 kmeans 是 Jenks 的合理替代品?
  • 通过在数据点的随机 1% 子集上运行 classInt 来定义类是否合理?

I want to partition a vector (length around 10^5) into five classes. With the function classIntervals from package classInt I wanted to use style = "jenks" natural breaks but this takes an inordinate amount of time even for a much smaller vector of only 500. Setting style = "kmeans" executes almost instantaneously.

library(classInt)

my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)

system.time(classIntervals(x, n = 5, style = "jenks"))
R> system.time(classIntervals(x, n = 5, style = "jenks"))
   user  system elapsed 
  13.46    0.00   13.45 

system.time(classIntervals(x, n = 5, style = "kmeans"))
R> system.time(classIntervals(x, n = 5, style = "kmeans"))
   user  system elapsed 
   0.02    0.00    0.02

What makes the Jenks algorithm so slow, and is there a faster way to run it?

If need be I will move the last two parts of the question to stats.stackexchange.com:

  • Under what circumstances is kmeans a reasonable substitute for Jenks?
  • Is it reasonable to define classes by running classInt on a random 1% subset of the data points?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

岁吢 2024-10-29 02:03:59

回答你原来的问题:

是什么让 Jenks 算法如此缓慢,有没有更快的方法
运行它吗?

事实上,同时还有一种更快的方法来应用 Jenks 算法,即 BAMMtools 包中的 setjenksBreaks 函数。

但是,请注意,您必须以不同的方式设置中断数,即,如果您在 classInt 包的 classIntervals 函数中将中断数设置为 5,则必须设置将 BAMMtools 包中的 setjenksBreaks 函数分成 6 个,以获得相同的结果。

# Install and load library
install.packages("BAMMtools")
library(BAMMtools)

# Set up example data
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)

# Apply function
getJenksBreaks(x, 6)

速度提升是巨大的,即

> microbenchmark( getJenksBreaks(x, 6, subset = NULL),  classIntervals(x, n = 5, style = "jenks"), unit="s", times=10)
Unit: seconds
                                      expr         min          lq        mean      median          uq         max neval cld
       getJenksBreaks(x, 6, subset = NULL) 0.002824861 0.003038748 0.003270575 0.003145692 0.003464058 0.004263771    10  a 
 classIntervals(x, n = 5, style = "jenks") 2.008109622 2.033353970 2.094278189 2.103680325 2.111840853 2.231148846    10   

To answer your original question:

What makes the Jenks algorithm so slow, and is there a faster way to
run it?

Indeed, meanwhile there is a faster way to apply the Jenks algorithm, the setjenksBreaks function in the BAMMtools package.

However, be aware that you have to set the number of breaks differently, i.e. if you set the breaks to 5 in the the classIntervals function of the classInt package you have to set the breaks to 6 the setjenksBreaks function in the BAMMtools package to get the same results.

# Install and load library
install.packages("BAMMtools")
library(BAMMtools)

# Set up example data
my_n <- 100
set.seed(1)
x <- mapply(rnorm, n = my_n, mean = (1:5) * 5)

# Apply function
getJenksBreaks(x, 6)

The speed up is huge, i.e.

> microbenchmark( getJenksBreaks(x, 6, subset = NULL),  classIntervals(x, n = 5, style = "jenks"), unit="s", times=10)
Unit: seconds
                                      expr         min          lq        mean      median          uq         max neval cld
       getJenksBreaks(x, 6, subset = NULL) 0.002824861 0.003038748 0.003270575 0.003145692 0.003464058 0.004263771    10  a 
 classIntervals(x, n = 5, style = "jenks") 2.008109622 2.033353970 2.094278189 2.103680325 2.111840853 2.231148846    10   
云醉月微眠 2024-10-29 02:03:59

来自 ?BAMMtools::getJenksBreaks

Jenks 自然中断方法已从 classInt R 包中的代码移植到 C。

两个程序是一样的;由于其实现方式,其中一种比另一种更快(C 与 R)。

From ?BAMMtools::getJenksBreaks

The Jenks natural breaks method was ported to C from code found in the classInt R package.

The two programs are the same; one is faster than the other because of their implementation (C vs R).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文