有效的代码可以使用BigZ级值进行设置操作?

发布于 2025-02-04 01:37:39 字数 282 浏览 4 评论 0原文

程序包的当前版本GMP不支持设置操作,例如IntersectsetDiff等。 (请参阅 oeis 以获取示例),并且需要处理大型整数集合。我目前坚持使用各种循环来生成所需的差异或交叉点;虽然我可能可以生成编译(RCCP等)代码,但我希望在现有r功能和软件包中找到一种方法。

The current release of the package gmp does not support set operations such as intersect, setdiff , etc. I'm doing some work with number sequences (see OEIS for examples) and need to handle large collections of large integers. I'm currently stuck with using various loops to generate the desired differences or intersections; while I could probably generate compiled (Rccp, etc) code, I'm hoping to find a way within existing R functions and packages.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

痕至 2025-02-11 01:37:39

使用bignum而不是GMP使用InterSect后,什么返回targine。并重新分配需要时间。

library(bignum)

t1 <- biginteger(1:4)
t2 <- biginteger(3:6)
intersect(t1, t2)
#[1] "3" "4"

biginteger(intersect(t1, t2))
#<biginteger[2]>
#[1] 3 4

bigz存储在list中,而不是vector

library(gmp)

intersect(as.bigz(1:4), as.bigz(3:6))
#Big Integer ('bigz') object of length 2:
#[1] 1 2

intersect(as.list(as.bigz(1:4)), as.list(as.bigz(3:6)))
#[[1]]
#Big Integer ('bigz') :
#[1] 3
#
#[[2]]
#Big Integer ('bigz') :
#[1] 4

setdiff(as.list(as.bigz(1:4)), as.list(as.bigz(3:6)))
#[[1]]
#Big Integer ('bigz') :
#[1] 1
#
#[[2]]
#Big Integer ('bigz') :
#[1] 2

f3 <- as.list(as.bigz(c(3,5,9,6,4)))
f4 <- as.list(as.bigz(c(6,7,8,10,9)))
intersect(f3, f4)
#[[1]]
#Big Integer ('bigz') :
#[1] 9
#
#[[2]]
#Big Integer ('bigz') :
#[1] 6

不幸的是,这比将其转换为角色要慢得多。

对于Intersect 重复可以使用。但这也比转换为角色要慢。

. <- c(unique(as.bigz(1:4)), unique(as.bigz(3:6)))
.[duplicated(.)]
#Big Integer ('bigz') object of length 2:
#[1] 3 4

彼此比较每个元素也较慢:

t1 <- unique(as.bigz(1:4))
t2 <- unique(as.bigz(3:6))
t1[sapply(t1, function(x) any(x == t2))]
#Big Integer ('bigz') object of length 2:
#[1] 3 4

在这种情况下,边缘更快的是:

t1 <- kit::funique(as.character(as.bigz(1:4)))
t2 <- as.character(as.bigz(3:6));
as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])
#Big Integer ('bigz') object of length 2:
#[1] 3 4

一个简单的RCPP版本。

library(Rcpp)
sourceCpp(code=r"(
#include <Rcpp.h>
#include <list>
#include <cstring>
using namespace Rcpp;
// [[Rcpp::export]]
RawVector fun(const RawVector &x, const RawVector &y) {
  std::list<uint32_t const*> xAdr;
  std::list<uint32_t const*> yAdr;
  const uint32_t *px = (const uint32_t *) &x[0];
  const uint32_t *py = (const uint32_t *) &y[0];
  uint32_t nextNr = 1;
  for(uint_fast32_t j=0; j<px[0]; ++j) {
    uint32_t n = px[nextNr] + 2;
    auto i = xAdr.cbegin();
    for(; i != xAdr.cend(); ++i) {
      if(std::memcmp(&px[nextNr], *i, n * sizeof(uint32_t)) == 0) break;
    }
    if(i == xAdr.cend()) xAdr.push_back(&px[nextNr]);
    nextNr += n;
  }
  nextNr = 1;
  for(uint_fast32_t j=0; j<py[0]; ++j) {
    yAdr.push_back(&py[nextNr]);
    nextNr += py[nextNr] + 2;;
  }
  uint32_t l=1;
  for(auto j = xAdr.cbegin(); j != xAdr.cend(); ) {
    auto i = yAdr.cbegin();
    for(; i != yAdr.cend(); ++i) {
      if(std::memcmp(*j, *i, (**j + 2) * sizeof(uint32_t)) == 0) {
        yAdr.erase(i);
        break;
      }
    }
    if(i == yAdr.cend()) j = xAdr.erase(j);
    else {l += **j + 2; ++j;}
  }
  RawVector res(Rcpp::no_init(l * sizeof(uint32_t)));
  res.attr("class") = "bigz";
  uint32_t *p = (uint32_t *) &res[0];
  p[0] = xAdr.size();
  nextNr = 1;
  for(auto j = xAdr.cbegin(); j != xAdr.cend(); ++j) {
    std::memcpy(&p[nextNr], *j, (**j + 2) * sizeof(uint32_t));
    nextNr += p[nextNr] + 2;
  }
  return res;
}
)")

fun(as.bigz(1:4), as.bigz(3:6))
#Big Integer ('bigz') object of length 2:
#[1] 3 4

基准

library(gmp)
x <- as.bigz("10000000000000000000000000000000000000000")+1:4
y <- as.bigz("10000000000000000000000000000000000000000")+3:6

library(bignum)
xb <- biginteger("10000000000000000000000000000000000000000")+1:4
yb <- biginteger("10000000000000000000000000000000000000000")+3:6

bench::mark(check = FALSE,
         "list" = do.call(c, intersect(as.list(x), as.list(y))),
         "char" = as.bigz(intersect(as.character(x), as.character(y))),
         "dupli" = {. <- c(unique(x), unique(y)); .[duplicated(.)]},
         "loop" = {t1 <- unique(x); t2 <- unique(y); t1[sapply(t1, function(x) any(x == t2))]},
         "own" = {t1 <- as.character(x); t2 <- as.character(y);
           x[!duplicated(t1) & match(t1, t2, 0L) > 0L]},
         "own2" = {t1 <- unique(as.character(x)); t2 <- as.character(y);
           as.bigz(t1[match(t1, t2, 0L) > 0L])},
         "kitFmat" = {t1 <- kit::funique(as.character(x)); t2 <- as.character(y);
           as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])},
         "collFmat" = {t1 <- collapse::funique(as.character(x)); t2 <- as.character(y);
           as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])},
         "bignum" = biginteger(intersect(xb, yb)),
         "rcppIdx" = fun(x, y),
         )

结果

   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
 1 list       135.89µs 247.15µs     3147.        0B     2.03  1552     1
 2 char         15.4µs  20.79µs    29933.        0B    12.0   9996     4
 3 dupli       54.58µs  86.69µs     7428.      280B     6.18  3604     3
 4 loop        80.11µs 156.36µs     4198.    6.36KB     8.26  2032     4
 5 own         15.46µs  27.09µs    25254.        0B     5.05  9998     2
 6 own2        13.69µs  20.53µs    28952.        0B     8.69  9997     3
 7 kitFmat     13.23µs  20.57µs    30964.        0B     6.19  9998     2
 8 collFmat    16.76µs  21.76µs    26667.    2.49KB     5.33  9998     2
 9 bignum      32.12µs  48.51µs    10784.        0B     8.47  5090     4
10 rcppIdx      2.35µs   3.09µs   212933.    2.49KB     0    10000     0

在这里看起来像是转换为targue,后背是当前最快的方法。甚至子集bigz也比子集cartare慢慢地将其转换为bigz
RCPP中的版本比最快的其他方法快6-7倍。

Use bignum instead of gmp what returns a character after using intersect. And reconverting needs time.

library(bignum)

t1 <- biginteger(1:4)
t2 <- biginteger(3:6)
intersect(t1, t2)
#[1] "3" "4"

biginteger(intersect(t1, t2))
#<biginteger[2]>
#[1] 3 4

Store the bigz in a list instead as a vector.

library(gmp)

intersect(as.bigz(1:4), as.bigz(3:6))
#Big Integer ('bigz') object of length 2:
#[1] 1 2

intersect(as.list(as.bigz(1:4)), as.list(as.bigz(3:6)))
#[[1]]
#Big Integer ('bigz') :
#[1] 3
#
#[[2]]
#Big Integer ('bigz') :
#[1] 4

setdiff(as.list(as.bigz(1:4)), as.list(as.bigz(3:6)))
#[[1]]
#Big Integer ('bigz') :
#[1] 1
#
#[[2]]
#Big Integer ('bigz') :
#[1] 2

f3 <- as.list(as.bigz(c(3,5,9,6,4)))
f4 <- as.list(as.bigz(c(6,7,8,10,9)))
intersect(f3, f4)
#[[1]]
#Big Integer ('bigz') :
#[1] 9
#
#[[2]]
#Big Integer ('bigz') :
#[1] 6

Unfortunately this is much slower than converting it to character.

For intersect duplicated could be used. But this is also slower than converting to character.

. <- c(unique(as.bigz(1:4)), unique(as.bigz(3:6)))
.[duplicated(.)]
#Big Integer ('bigz') object of length 2:
#[1] 3 4

And comparing each element with each other is also slower:

t1 <- unique(as.bigz(1:4))
t2 <- unique(as.bigz(3:6))
t1[sapply(t1, function(x) any(x == t2))]
#Big Integer ('bigz') object of length 2:
#[1] 3 4

Marginal faster in this case is:

t1 <- kit::funique(as.character(as.bigz(1:4)))
t2 <- as.character(as.bigz(3:6));
as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])
#Big Integer ('bigz') object of length 2:
#[1] 3 4

And a simple Rcpp versions.

library(Rcpp)
sourceCpp(code=r"(
#include <Rcpp.h>
#include <list>
#include <cstring>
using namespace Rcpp;
// [[Rcpp::export]]
RawVector fun(const RawVector &x, const RawVector &y) {
  std::list<uint32_t const*> xAdr;
  std::list<uint32_t const*> yAdr;
  const uint32_t *px = (const uint32_t *) &x[0];
  const uint32_t *py = (const uint32_t *) &y[0];
  uint32_t nextNr = 1;
  for(uint_fast32_t j=0; j<px[0]; ++j) {
    uint32_t n = px[nextNr] + 2;
    auto i = xAdr.cbegin();
    for(; i != xAdr.cend(); ++i) {
      if(std::memcmp(&px[nextNr], *i, n * sizeof(uint32_t)) == 0) break;
    }
    if(i == xAdr.cend()) xAdr.push_back(&px[nextNr]);
    nextNr += n;
  }
  nextNr = 1;
  for(uint_fast32_t j=0; j<py[0]; ++j) {
    yAdr.push_back(&py[nextNr]);
    nextNr += py[nextNr] + 2;;
  }
  uint32_t l=1;
  for(auto j = xAdr.cbegin(); j != xAdr.cend(); ) {
    auto i = yAdr.cbegin();
    for(; i != yAdr.cend(); ++i) {
      if(std::memcmp(*j, *i, (**j + 2) * sizeof(uint32_t)) == 0) {
        yAdr.erase(i);
        break;
      }
    }
    if(i == yAdr.cend()) j = xAdr.erase(j);
    else {l += **j + 2; ++j;}
  }
  RawVector res(Rcpp::no_init(l * sizeof(uint32_t)));
  res.attr("class") = "bigz";
  uint32_t *p = (uint32_t *) &res[0];
  p[0] = xAdr.size();
  nextNr = 1;
  for(auto j = xAdr.cbegin(); j != xAdr.cend(); ++j) {
    std::memcpy(&p[nextNr], *j, (**j + 2) * sizeof(uint32_t));
    nextNr += p[nextNr] + 2;
  }
  return res;
}
)")

fun(as.bigz(1:4), as.bigz(3:6))
#Big Integer ('bigz') object of length 2:
#[1] 3 4

Benchmark

library(gmp)
x <- as.bigz("10000000000000000000000000000000000000000")+1:4
y <- as.bigz("10000000000000000000000000000000000000000")+3:6

library(bignum)
xb <- biginteger("10000000000000000000000000000000000000000")+1:4
yb <- biginteger("10000000000000000000000000000000000000000")+3:6

bench::mark(check = FALSE,
         "list" = do.call(c, intersect(as.list(x), as.list(y))),
         "char" = as.bigz(intersect(as.character(x), as.character(y))),
         "dupli" = {. <- c(unique(x), unique(y)); .[duplicated(.)]},
         "loop" = {t1 <- unique(x); t2 <- unique(y); t1[sapply(t1, function(x) any(x == t2))]},
         "own" = {t1 <- as.character(x); t2 <- as.character(y);
           x[!duplicated(t1) & match(t1, t2, 0L) > 0L]},
         "own2" = {t1 <- unique(as.character(x)); t2 <- as.character(y);
           as.bigz(t1[match(t1, t2, 0L) > 0L])},
         "kitFmat" = {t1 <- kit::funique(as.character(x)); t2 <- as.character(y);
           as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])},
         "collFmat" = {t1 <- collapse::funique(as.character(x)); t2 <- as.character(y);
           as.bigz(t1[fastmatch::fmatch(t1, t2, 0L) > 0L])},
         "bignum" = biginteger(intersect(xb, yb)),
         "rcppIdx" = fun(x, y),
         )

Result

   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
 1 list       135.89µs 247.15µs     3147.        0B     2.03  1552     1
 2 char         15.4µs  20.79µs    29933.        0B    12.0   9996     4
 3 dupli       54.58µs  86.69µs     7428.      280B     6.18  3604     3
 4 loop        80.11µs 156.36µs     4198.    6.36KB     8.26  2032     4
 5 own         15.46µs  27.09µs    25254.        0B     5.05  9998     2
 6 own2        13.69µs  20.53µs    28952.        0B     8.69  9997     3
 7 kitFmat     13.23µs  20.57µs    30964.        0B     6.19  9998     2
 8 collFmat    16.76µs  21.76µs    26667.    2.49KB     5.33  9998     2
 9 bignum      32.12µs  48.51µs    10784.        0B     8.47  5090     4
10 rcppIdx      2.35µs   3.09µs   212933.    2.49KB     0    10000     0

Here it looks like that converting to character and back is currently the fastest way. Even subsetting bigz is slower than subsetting the character and converting it to bigz.
The Version in Rcpp is about 6-7 times faster than the fastest other method.

盛夏已如深秋| 2025-02-11 01:37:39

我可能弄错了,但是使用mprf对象提供了对基础r IntersectUnionsetDiff的访问权限。 ,而排序(...需要包装在mprf(sort(...),'bits')

library(Rmprf)

f3 <- mpfr(5:9, 53)
f4 <- mpfr(8:12, 53)

intersect(f3,f4)
2 'mpfr' numbers of precision  53   bits 
[1] 8 9

setdiff(f3,f4)
3 'mpfr' numbers of precision  53   bits 
[1] 5 6 7

f3 %in% f4
[1] FALSE FALSE FALSE  TRUE  TRUE

# large integers from vignette
ns <- mpfr(1:24, 120)
fact_ns <- factorial(ns)
fact_ns[20:24]
5 'mpfr' numbers of precision  120   bits 
[1]      2432902008176640000     51090942171709440000   1124000727777607680000
[4]  25852016738884976640000 620448401733239439360000

pasc80 <- chooseMpfr.all(n = 80, 77)[40:49]
pasc80
10 'mpfr' numbers of precision  77   bits 
 [1] 107507208733336176461620 104885081691059684352800  97393290141698278327600
 [4]  86068488962431036661600  72375774809317008101800  57900619847453606481440
 [7]  44054819449149483192400  31869443856831541032800  21910242651571684460050
[10]  14308729894903957198400

mpfr(sort(union(fact_ns[20:24], pasc80)), 77)
15 'mpfr' numbers of precision  77   bits 
 [1]      2432902008176640000     51090942171709440000   1124000727777607680000
 [4]  14308729894903957198400  21910242651571684460050  25852016738884976640000
 [7]  31869443856831541032800  44054819449149483192400  57900619847453606481440
[10]  72375774809317008101800  86068488962431036661600  97393290141698278327600
[13] 104885081691059684352800 107507208733336176461620 6204484017332394393600

因此,对于这些操作是不需要的,并且假设您的工作流程适合rmprf基于的对象

。 决定(尽管我在这里寻找一个)。

Demotes将

getPrec(f7)[1]
[1] 10
getPrec(f8)[1]
[1] 20
intersect(roundMpfr(f7, 20), f8)
2 'mpfr' numbers of precision  20   bits 
[1] 9 6
intersect(f7, roundMpfr(f8, 10))
2 'mpfr' numbers of precision  10   bits 
[1] 9 6

其最高/最低的“ Prec”设置为有意参与该 设置操作是需要处理的,尽管如果MPFR创建时,可以避免使用此类周期,默认值将使用相同的精度来渲染输入

library(OEIS.R) # git clone of EnriquePH/OEIS.R --no-build-vignettes
A011784 <- OEIS_bfile('A011784')
max(nchar(A011784$data$A011784))
[1] 221
max(nchar(A078140$data$A078140))
[1] 228

# so we see precision handling here, perhaps
A011784_228 <- mpfr(A011784$data$A011784, 228)
A078140_228 <- mpfr(A078140$data$A078140, 228)

intersect(A011784_228,A078140_228)
2 'mpfr' numbers of precision  228   bits 
[1] 1 3

。并不是说您的序列是在OEI中,而是检查与“野外”序列中的序列相似,这并不能反映您的工作流程。

至于使用列表:

is(A011784_bigz)
[1] "bigz"     "oldClass" "Mnumber"  "mNumber" 
> is(A011784_228)
[1] "mpfr"    "list"    "Mnumber" "mNumber" "vector"

因此,这些清单周期已经在MPFR创建中花费了。

以及一些相关的,光读数原始集合来自最近的新闻。

I've likely got this wrong, but using mprf objects provides access to base R intersect, union, setdiff, while a sort(... needs to be wrapped inside a mprf(sort(...), 'bits'):

library(Rmprf)

f3 <- mpfr(5:9, 53)
f4 <- mpfr(8:12, 53)

intersect(f3,f4)
2 'mpfr' numbers of precision  53   bits 
[1] 8 9

setdiff(f3,f4)
3 'mpfr' numbers of precision  53   bits 
[1] 5 6 7

f3 %in% f4
[1] FALSE FALSE FALSE  TRUE  TRUE

# large integers from vignette
ns <- mpfr(1:24, 120)
fact_ns <- factorial(ns)
fact_ns[20:24]
5 'mpfr' numbers of precision  120   bits 
[1]      2432902008176640000     51090942171709440000   1124000727777607680000
[4]  25852016738884976640000 620448401733239439360000

pasc80 <- chooseMpfr.all(n = 80, 77)[40:49]
pasc80
10 'mpfr' numbers of precision  77   bits 
 [1] 107507208733336176461620 104885081691059684352800  97393290141698278327600
 [4]  86068488962431036661600  72375774809317008101800  57900619847453606481440
 [7]  44054819449149483192400  31869443856831541032800  21910242651571684460050
[10]  14308729894903957198400

mpfr(sort(union(fact_ns[20:24], pasc80)), 77)
15 'mpfr' numbers of precision  77   bits 
 [1]      2432902008176640000     51090942171709440000   1124000727777607680000
 [4]  14308729894903957198400  21910242651571684460050  25852016738884976640000
 [7]  31869443856831541032800  44054819449149483192400  57900619847453606481440
[10]  72375774809317008101800  86068488962431036661600  97393290141698278327600
[13] 104885081691059684352800 107507208733336176461620 6204484017332394393600

so for these operations sets is not necessary, and assuming your workflow is amenable to Rmprf based objects.

As the problem is presented in the context of 'precision', one likely wouldn't want a function that promotes or demotes sets to their highest/lowest 'prec', but be intentionally involved in the decision (though, admittedly, I looked for one).

Here, renaming your f3 & f4 below to f7 & f8:

getPrec(f7)[1]
[1] 10
getPrec(f8)[1]
[1] 20
intersect(roundMpfr(f7, 20), f8)
2 'mpfr' numbers of precision  20   bits 
[1] 9 6
intersect(f7, roundMpfr(f8, 10))
2 'mpfr' numbers of precision  10   bits 
[1] 9 6

So it appears that 'precision handling' is required as to set operations, though such additional cycles may be avoided if it is plausible that upon mpfr creation, defaults would render the inputs the same precision. Using OEIS as inputs:

library(OEIS.R) # git clone of EnriquePH/OEIS.R --no-build-vignettes
A011784 <- OEIS_bfile('A011784')
max(nchar(A011784$data$A011784))
[1] 221
max(nchar(A078140$data$A078140))
[1] 228

# so we see precision handling here, perhaps
A011784_228 <- mpfr(A011784$data$A011784, 228)
A078140_228 <- mpfr(A078140$data$A078140, 228)

intersect(A011784_228,A078140_228)
2 'mpfr' numbers of precision  228   bits 
[1] 1 3

Ah, so little in common. And it is probably not that your sequences are in OEIS, rather checking for similarity to those from your sequences 'from the wild', and this doesn't reflect your workflow.

As to using lists:

is(A011784_bigz)
[1] "bigz"     "oldClass" "Mnumber"  "mNumber" 
> is(A011784_228)
[1] "mpfr"    "list"    "Mnumber" "mNumber" "vector"

So those as.list cycles have already been expended in mpfr creation.

And some related, light reading primitive sets from recent news.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文