如果遇到非有限值（NA、NaN 或 Inf），如何强制出错

发布于 2025-01-03 04:50:16 字数 2015 浏览 2 评论 0原文

我错过了 Matlab 中的一个条件调试标志：dbstop if infnan

我怎样才能在 R 中以比每次赋值操作后测试所有对象更有效的方式实现这一点？

目前，我认为做到这一点的唯一方法是通过像下面这样的黑客：

在可能遇到这些值的所有位置之后手动插入一个测试（例如，除法，可能会发生除以 0 的情况）。测试将使用 is.finite()、本问答中描述的A，在每个元素上。
使用 body() 修改代码以在每次操作或可能只是每次赋值后调用单独的函数，该函数测试所有对象（也可能测试所有环境中的所有对象）。
修改 R 的源代码 (?!?)
尝试使用 tracemem 来识别那些已更改的变量，并仅检查这些变量是否有错误值。
（新 - 请参阅注释 2）使用某种调用处理程序/回调来调用测试函数。

第一个选择是我目前正在做的。这很乏味，因为我不能保证我已经检查了所有内容。第二个选项将测试所有内容，即使对象尚未更新。这是对时间的巨大浪费。第三个选项涉及修改 NA、NaN 和无限值 (+/- Inf) 的分配，从而产生错误。这似乎最好留给 R Core。第四个选项与第二个类似 - 我需要调用一个单独的函数来列出所有内存位置，只是为了识别那些已更改的内存位置，然后检查值；我什至不确定这是否适用于所有对象，因为程序可能会进行就地修改，这似乎不会调用 duplicate 函数。

我缺少更好的方法吗？也许 Mark Bravington、Luke Tierney 提供了一些巧妙的工具，或者一些相对基本的工具 - 类似于编译 R 时的 options() 参数或标志？

示例代码 这里有一些非常简单的示例代码可供测试，其中包含 Josh O'Brien 提出的 addTaskCallback 函数。代码不会被中断，但在第一种情况下确实会发生错误，而在第二种情况下不会发生错误（即 badDiv(0,0,FALSE) 不会中止）。我仍在研究回调，因为这看起来很有希望。

badDiv  <- function(x, y, flag){
    z = x / y
    if(flag == TRUE){
        return(z)
    } else {
        return(FALSE)
    }
}

addTaskCallback(stopOnNaNs)
badDiv(0, 0, TRUE)

addTaskCallback(stopOnNaNs)
badDiv(0, 0, FALSE)

注 1：我对标准 R 操作的解决方案感到满意，尽管我的很多计算涉及通过 data.table 或 bigmemory 使用的对象（即基于磁盘的内存映射矩阵）。这些似乎与标准矩阵和数据帧操作有一些不同的内存行为。

注 2：回调的想法似乎更有前途，因为这不需要我编写改变 R 代码的函数，例如通过 body() 想法。

注 3：我不知道是否有一些简单的方法来测试非有限值的存在，例如，有关索引 NA、Infs 等存储在对象中的位置的对象的元信息，或者这些是否是存放到位。到目前为止，我已经尝试了 Simon Urbanek 的 inspect 包，但还没有找到一种方法来预测非数字值的存在。

后续：Simon Urbanek 在评论中指出，此类信息不能作为对象的元信息。

注 4：我仍在测试所提出的想法。此外，正如 Simon 所建议的，在 C/C++ 中测试非有限值的存在应该是最快的；这应该超越编译后的 R 代码，但我对任何事情都持开放态度。对于大型数据集，例如大约 10-50GB，这比复制数据应该节省大量成本。通过使用多核可能会得到进一步的改进，但这有点更先进。

原文

There's a conditional debugging flag I miss from Matlab: dbstop if infnan described here. If set, this condition will stop code execution when an Inf or NaN is encountered (IIRC, Matlab doesn't have NAs).

How might I achieve this in R in a more efficient manner than testing all objects after every assignment operation?

At the moment, the only ways I see to do this are via hacks like the following:

Manually insert a test after all places where these values might be encountered (e.g. a division, where division by 0 may occur). The testing would be to use is.finite(), described in this Q & A, on every element.
Use body() to modify the code to call a separate function, after each operation or possibly just each assignment, which tests all of the objects (and possibly all objects in all environments).
Modify R's source code (?!?)
Attempt to use tracemem to identify those variables that have changed, and check only these for bad values.
(New - see note 2) Use some kind of call handlers / callbacks to invoke a test function.

The 1st option is what I am doing at present. This is tedious, because I can't guarantee I've checked everything. The 2nd option will test everything, even if an object hasn't been updated. That is a massive waste of time. The 3rd option would involve modifying assignments of NA, NaN, and infinite values (+/- Inf), so that an error is produced. That seems like it's better left to R Core. The 4th option is like the 2nd - I'd need a call to a separate function listing all of the memory locations, just to ID those that have changed, and then check the values; I'm not even sure this will work for all objects, as a program may do an in-place modification, which seems like it would not invoke the duplicate function.

Is there a better approach that I'm missing? Maybe some clever tool by Mark Bravington, Luke Tierney, or something relatively basic - something akin to an options() parameter or a flag when compiling R?

Example code Here is some very simple example code to test with, incorporating the addTaskCallback function proposed by Josh O'Brien. The code isn't interrupted, but an error does occur in the first scenario, while no error occurs in the second case (i.e. badDiv(0,0,FALSE) doesn't abort). I'm still investigating callbacks, as this looks promising.

badDiv  <- function(x, y, flag){
    z = x / y
    if(flag == TRUE){
        return(z)
    } else {
        return(FALSE)
    }
}

addTaskCallback(stopOnNaNs)
badDiv(0, 0, TRUE)

addTaskCallback(stopOnNaNs)
badDiv(0, 0, FALSE)

Note 1. I'd be satisfied with a solution for standard R operations, though a lot of my calculations involve objects used via data.table or bigmemory (i.e. disk-based memory mapped matrices). These appear to have somewhat different memory behaviors than standard matrix and data.frame operations.

Note 2. The callbacks idea seems a bit more promising, as this doesn't require me to write functions that mutate R code, e.g. via the body() idea.

Note 3. I don't know whether or not there is some simple way to test the presence of non-finite values, e.g. meta information about objects that indexes where NAs, Infs, etc. are stored in the object, or if these are stored in place. So far, I've tried Simon Urbanek's inspect package, and have not found a way to divine the presence of non-numeric values.

Follow-up: Simon Urbanek has pointed out in a comment that such information is not available as meta information for objects.

Note 4. I'm still testing the ideas presented. Also, as suggested by Simon, testing for the presence of non-finite values should be fastest in C/C++; that should surpass even compiled R code, but I'm open to anything. For large datasets, e.g. on the order of 10-50GB, this should be a substantial savings over copying the data. One may get further improvements via use of multiple cores, but that's a bit more advanced.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

少女的英雄梦 2025-01-10 04:50:16

下面概述的想法（及其实现）非常不完美。我什至犹豫是否建议它，但是：（a）我认为它有点有趣，即使它很丑陋； (b) 我能想到它有用的情况。鉴于您现在似乎在每次计算后手动插入检查，我希望您的情况就是其中之一。

我的方法是两步破解。首先，我定义一个函数 nanDetector()，该函数旨在检测计算可能返回的多种对象类型中的 NaN。然后，在每个顶级任务/计算完成后，它使用 addTaskCallback() 在 .Last.value 上调用函数 nanDetector() 。当它在这些返回值之一中发现 NaN 时，它会抛出一个错误，您可以使用该错误来避免任何进一步的计算。

它的缺点是：

如果您执行诸如设置 stop(error = recovery) 之类的操作，则很难判断错误是在哪里触发的，因为错误总是从 stopOnNaNs 内部抛出().
当它抛出错误时，stopOnNaNs() 在返回 TRUE 之前终止。因此，它会从任务列表中删除，如果您想再次使用它，则需要使用 addTaskCallback(stopOnNaNs) 进行重置。（请参阅 ?addTaskCallback< 的 “参数”部分/a> 了解更多详细信息）。

话不多说，这里是：

# Sketch of a function that tests for NaNs in several types of objects
nanDetector <- function(X) {
   # To examine data frames
   if(is.data.frame(X)) { 
       return(any(unlist(sapply(X, is.nan))))
   }
   # To examine vectors, matrices, or arrays
   if(is.numeric(X)) {
       return(any(is.nan(X)))
   }
   # To examine lists, including nested lists
   if(is.list(X)) {
       return(any(rapply(X, is.nan)))
   }
   return(FALSE)
}

# Set up the taskCallback
stopOnNaNs <- function(...) {
    if(nanDetector(.Last.value)) {stop("NaNs detected!\n")}
    return(TRUE)
}
addTaskCallback(stopOnNaNs)


# Try it out
j <- 1:00
y <- rnorm(99)
l <- list(a=1:4, b=list(j=1:4, k=NaN))
# Error in function (...)  : NaNs detected!

# Subsequent time consuming code that could be avoided if the
# error thrown above is used to stop its evaluation.

The idea sketched below (and its implementation) is very imperfect. I'm hesitant to even suggest it, but: (a) I think it's kind of interesting, even in all of its ugliness; and (b) I can think of situations where it would be useful. Given that it sounds like you are right now manually inserting a check after each computation, I'm hopeful that your situation is one of those.

Mine is a two-step hack. First, I define a function nanDetector() which is designed to detect NaNs in several of the object types that might be returned by your calculations. Then, it using addTaskCallback() to call the function nanDetector() on .Last.value after each top-level task/calculation is completed. When it finds an NaN in one of those returned values, it throws an error, which you can use to avoid any further computations.

Among its shortcomings:

If you do something like setting stop(error = recover), it's hard to tell where the error was triggered, since the error is always thrown from inside of stopOnNaNs().
When it throws an error, stopOnNaNs() is terminated before it can return TRUE. As a consequence, it is removed from the task list, and you'll need to reset with addTaskCallback(stopOnNaNs) it you want to use it again. (See the 'Arguments' section of ?addTaskCallback for more details).

Without further ado, here it is:

# Sketch of a function that tests for NaNs in several types of objects
nanDetector <- function(X) {
   # To examine data frames
   if(is.data.frame(X)) { 
       return(any(unlist(sapply(X, is.nan))))
   }
   # To examine vectors, matrices, or arrays
   if(is.numeric(X)) {
       return(any(is.nan(X)))
   }
   # To examine lists, including nested lists
   if(is.list(X)) {
       return(any(rapply(X, is.nan)))
   }
   return(FALSE)
}

# Set up the taskCallback
stopOnNaNs <- function(...) {
    if(nanDetector(.Last.value)) {stop("NaNs detected!\n")}
    return(TRUE)
}
addTaskCallback(stopOnNaNs)


# Try it out
j <- 1:00
y <- rnorm(99)
l <- list(a=1:4, b=list(j=1:4, k=NaN))
# Error in function (...)  : NaNs detected!

# Subsequent time consuming code that could be avoided if the
# error thrown above is used to stop its evaluation.

回复收藏 0 原文

梦里南柯 2025-01-10 04:50:16

恐怕没有这样的捷径。理论上，在 unix 上，您可以捕获 SIGFPE ，但实际上，

没有标准方法可以启用 FP 操作来捕获它（即使 C99 也不包含此规定） - 它是高度系统特定（例如，Linux 上的 feenableexcept、AIX 上的 fp_enable_all 等）或需要为目标 CPU 使用汇编程序
FP 操作现在通常以向量单元完成像 SSE 一样，所以你甚至不能确定 FPU 是否参与其中，而
R 会拦截 NaN、NA 等操作，并单独处理它们，这样它们就不会使其成为 FP 代码

也就是说，如果您足够努力（禁用 SSE 等），您可以自己编写一个 R，它会捕获您的平台和 CPU 的一些异常。我们不会考虑将其构建到 R 中，但出于特殊目的它可能是可行的。

但是，除非您更改 R 内部代码，否则它仍然无法捕获 NaN/NA 操作。此外，您必须检查您正在使用的每个包，因为它们可能在其 C 代码中使用 FP 运算，并且还可能单独处理 NA/NaN。

如果您只担心除以零或上溢/下溢之类的问题，则上述内容将起作用，并且可能是最接近解决方案之类的问题。

仅检查结果可能不太可靠，因为您不知道结果是否基于某些中间 NaN 计算，该计算更改了可能不需要为 NaN 的聚合值代码> 也是如此。如果您愿意放弃这种情况，那么您可以简单地递归遍历结果对象或工作区。这不应该是非常低效的，因为你只需要担心 REALSXP 而不是其他任何事情（除非你也不喜欢 NA - 那么你会有更多工作）。

这是可用于递归遍历 R 对象的示例代码：

static int do_isFinite(SEXP x) {
    /* recurse into generic vectors (lists) */
    if (TYPEOF(x) == VECSXP) {
        int n = LENGTH(x);
        for (int i = 0; i < n; i++)
            if (!do_isFinite(VECTOR_ELT(x, i))) return 0;
    }
    /* recurse into pairlists */ 
    if (TYPEOF(x) == LISTSXP) {
         while (x != R_NilValue) {
             if (!do_isFinite(CAR(x))) return 0;
             x = CDR(x);
         }
         return 1;
    }
    /* I wouldn't bother with attributes except for S4
       where attributes are slots */
    if (IS_S4_OBJECT(x) && !do_isFinite(ATTRIB(x))) return 0;
    /* check reals */
    if (TYPEOF(x) == REALSXP) {
        int n = LENGTH(x);
        double *d = REAL(x);
        for (int i = 0; i < n; i++) if (!R_finite(d[i])) return 0;
    }
    return 1; 
}

SEXP isFinite(SEXP x) { return ScalarLogical(do_isFinite(x)); }

# in R: .Call("isFinite", x)

I fear there is no such shortcut. In theory on unix there is SIGFPE that you could trap on, but in practice

there is no standard way to enable FP operations to trap it (even C99 doesn't include a provision for that) - it is highly system-specifc (e.g. feenableexcept on Linux, fp_enable_all on AIX etc.) or requires the use of assembler for your target CPU
FP operations are nowadays often done in vector units like SSE so you can't be even sure that FPU is involved and
R intercepts some operations on things like NaNs, NAs and handles them separately so they won't make it to the FP code

That said, you could hack yourself an R that will catch some exceptions for your platform and CPU if you tried hard enough (disable SSE etc.). It is not something we would consider building into R, but for a special purpose it may be doable.

However, it would still not catch NaN/NA operations unless you change R internal code. In addition, you would have to check every single package you are using since they may be using FP operations in their C code and may also handle NA/NaN separately.

If you are only worried about things like division by zero or over/underflows, the above will work and is probably the closest to something like a solution.

Just checking your results may not be very reliable, because you don't know whether a result is based on some intermediate NaN calculation that changed an aggregated value which may not need to be NaN as well. If you are willing to discard such case, then you could simply walk recursively through your result objects or the workspace. That should not be extremely inefficient, because you only need to worry about REALSXP and not anything else (unless you don't like NAs either - then you'd have more work).

This is an example code that could be used to traverse R object recursively:

static int do_isFinite(SEXP x) {
    /* recurse into generic vectors (lists) */
    if (TYPEOF(x) == VECSXP) {
        int n = LENGTH(x);
        for (int i = 0; i < n; i++)
            if (!do_isFinite(VECTOR_ELT(x, i))) return 0;
    }
    /* recurse into pairlists */ 
    if (TYPEOF(x) == LISTSXP) {
         while (x != R_NilValue) {
             if (!do_isFinite(CAR(x))) return 0;
             x = CDR(x);
         }
         return 1;
    }
    /* I wouldn't bother with attributes except for S4
       where attributes are slots */
    if (IS_S4_OBJECT(x) && !do_isFinite(ATTRIB(x))) return 0;
    /* check reals */
    if (TYPEOF(x) == REALSXP) {
        int n = LENGTH(x);
        double *d = REAL(x);
        for (int i = 0; i < n; i++) if (!R_finite(d[i])) return 0;
    }
    return 1; 
}

SEXP isFinite(SEXP x) { return ScalarLogical(do_isFinite(x)); }

# in R: .Call("isFinite", x)

回复收藏 0 原文

~没有更多了~