如果遇到非有限值(NA、NaN 或 Inf),如何强制出错
我错过了 Matlab 中的一个条件调试标志:dbstop if infnan
我怎样才能在 R 中以比每次赋值操作后测试所有对象更有效的方式实现这一点?
目前,我认为做到这一点的唯一方法是通过像下面这样的黑客:
- 在可能遇到这些值的所有位置之后手动插入一个测试(例如,除法,可能会发生除以 0 的情况)。测试将使用
is.finite()
、本问答中描述的A,在每个元素上。 - 使用
body()
修改代码以在每次操作或可能只是每次赋值后调用单独的函数,该函数测试所有对象(也可能测试所有环境中的所有对象)。 - 修改 R 的源代码 (?!?)
- 尝试使用
tracemem
来识别那些已更改的变量,并仅检查这些变量是否有错误值。 - (新 - 请参阅注释 2)使用某种调用处理程序/回调来调用测试函数。
第一个选择是我目前正在做的。这很乏味,因为我不能保证我已经检查了所有内容。第二个选项将测试所有内容,即使对象尚未更新。这是对时间的巨大浪费。第三个选项涉及修改 NA、NaN 和无限值 (+/- Inf) 的分配,从而产生错误。这似乎最好留给 R Core。第四个选项与第二个类似 - 我需要调用一个单独的函数来列出所有内存位置,只是为了识别那些已更改的内存位置,然后检查值;我什至不确定这是否适用于所有对象,因为程序可能会进行就地修改,这似乎不会调用 duplicate
函数。
我缺少更好的方法吗?也许 Mark Bravington、Luke Tierney 提供了一些巧妙的工具,或者一些相对基本的工具 - 类似于编译 R 时的 options()
参数或标志?
示例代码 这里有一些非常简单的示例代码可供测试,其中包含 Josh O'Brien 提出的 addTaskCallback
函数。代码不会被中断,但在第一种情况下确实会发生错误,而在第二种情况下不会发生错误(即 badDiv(0,0,FALSE)
不会中止)。我仍在研究回调,因为这看起来很有希望。
badDiv <- function(x, y, flag){
z = x / y
if(flag == TRUE){
return(z)
} else {
return(FALSE)
}
}
addTaskCallback(stopOnNaNs)
badDiv(0, 0, TRUE)
addTaskCallback(stopOnNaNs)
badDiv(0, 0, FALSE)
注 1:我对标准 R 操作的解决方案感到满意,尽管我的很多计算涉及通过 data.table
或 bigmemory
使用的对象(即基于磁盘的内存映射矩阵)。这些似乎与标准矩阵和数据帧操作有一些不同的内存行为。
注 2:回调的想法似乎更有前途,因为这不需要我编写改变 R 代码的函数,例如通过 body()
想法。
注 3:我不知道是否有一些简单的方法来测试非有限值的存在,例如,有关索引 NA、Infs 等存储在对象中的位置的对象的元信息,或者这些是否是存放到位。到目前为止,我已经尝试了 Simon Urbanek 的 inspect
包,但还没有找到一种方法来预测非数字值的存在。
后续:Simon Urbanek 在评论中指出,此类信息不能作为对象的元信息。
注 4:我仍在测试所提出的想法。此外,正如 Simon 所建议的,在 C/C++ 中测试非有限值的存在应该是最快的;这应该超越编译后的 R 代码,但我对任何事情都持开放态度。对于大型数据集,例如大约 10-50GB,这比复制数据应该节省大量成本。通过使用多核可能会得到进一步的改进,但这有点更先进。
There's a conditional debugging flag I miss from Matlab: dbstop if infnan
described here. If set, this condition will stop code execution when an Inf
or NaN
is encountered (IIRC, Matlab doesn't have NAs).
How might I achieve this in R in a more efficient manner than testing all objects after every assignment operation?
At the moment, the only ways I see to do this are via hacks like the following:
- Manually insert a test after all places where these values might be encountered (e.g. a division, where division by 0 may occur). The testing would be to use
is.finite()
, described in this Q & A, on every element. - Use
body()
to modify the code to call a separate function, after each operation or possibly just each assignment, which tests all of the objects (and possibly all objects in all environments). - Modify R's source code (?!?)
- Attempt to use
tracemem
to identify those variables that have changed, and check only these for bad values. - (New - see note 2) Use some kind of call handlers / callbacks to invoke a test function.
The 1st option is what I am doing at present. This is tedious, because I can't guarantee I've checked everything. The 2nd option will test everything, even if an object hasn't been updated. That is a massive waste of time. The 3rd option would involve modifying assignments of NA, NaN, and infinite values (+/- Inf), so that an error is produced. That seems like it's better left to R Core. The 4th option is like the 2nd - I'd need a call to a separate function listing all of the memory locations, just to ID those that have changed, and then check the values; I'm not even sure this will work for all objects, as a program may do an in-place modification, which seems like it would not invoke the duplicate
function.
Is there a better approach that I'm missing? Maybe some clever tool by Mark Bravington, Luke Tierney, or something relatively basic - something akin to an options()
parameter or a flag when compiling R?
Example code Here is some very simple example code to test with, incorporating the addTaskCallback
function proposed by Josh O'Brien. The code isn't interrupted, but an error does occur in the first scenario, while no error occurs in the second case (i.e. badDiv(0,0,FALSE)
doesn't abort). I'm still investigating callbacks, as this looks promising.
badDiv <- function(x, y, flag){
z = x / y
if(flag == TRUE){
return(z)
} else {
return(FALSE)
}
}
addTaskCallback(stopOnNaNs)
badDiv(0, 0, TRUE)
addTaskCallback(stopOnNaNs)
badDiv(0, 0, FALSE)
Note 1. I'd be satisfied with a solution for standard R operations, though a lot of my calculations involve objects used via data.table
or bigmemory
(i.e. disk-based memory mapped matrices). These appear to have somewhat different memory behaviors than standard matrix and data.frame operations.
Note 2. The callbacks idea seems a bit more promising, as this doesn't require me to write functions that mutate R code, e.g. via the body()
idea.
Note 3. I don't know whether or not there is some simple way to test the presence of non-finite values, e.g. meta information about objects that indexes where NAs, Infs, etc. are stored in the object, or if these are stored in place. So far, I've tried Simon Urbanek's inspect
package, and have not found a way to divine the presence of non-numeric values.
Follow-up: Simon Urbanek has pointed out in a comment that such information is not available as meta information for objects.
Note 4. I'm still testing the ideas presented. Also, as suggested by Simon, testing for the presence of non-finite values should be fastest in C/C++; that should surpass even compiled R code, but I'm open to anything. For large datasets, e.g. on the order of 10-50GB, this should be a substantial savings over copying the data. One may get further improvements via use of multiple cores, but that's a bit more advanced.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
下面概述的想法(及其实现)非常不完美。我什至犹豫是否建议它,但是:(a)我认为它有点有趣,即使它很丑陋; (b) 我能想到它有用的情况。鉴于您现在似乎在每次计算后手动插入检查,我希望您的情况就是其中之一。
我的方法是两步破解。首先,我定义一个函数
nanDetector()
,该函数旨在检测计算可能返回的多种对象类型中的NaN
。然后,在每个顶级任务/计算完成后,它使用addTaskCallback()
在.Last.value
上调用函数nanDetector()
。当它在这些返回值之一中发现NaN
时,它会抛出一个错误,您可以使用该错误来避免任何进一步的计算。它的缺点是:
如果您执行诸如设置
stop(error = recovery)
之类的操作,则很难判断错误是在哪里触发的,因为错误总是从stopOnNaNs 内部抛出()
.当它抛出错误时,
stopOnNaNs()
在返回TRUE
之前终止。因此,它会从任务列表中删除,如果您想再次使用它,则需要使用addTaskCallback(stopOnNaNs)
进行重置。 (请参阅 ?addTaskCallback< 的 “参数”部分/a> 了解更多详细信息)。话不多说,这里是:
The idea sketched below (and its implementation) is very imperfect. I'm hesitant to even suggest it, but: (a) I think it's kind of interesting, even in all of its ugliness; and (b) I can think of situations where it would be useful. Given that it sounds like you are right now manually inserting a check after each computation, I'm hopeful that your situation is one of those.
Mine is a two-step hack. First, I define a function
nanDetector()
which is designed to detectNaN
s in several of the object types that might be returned by your calculations. Then, it usingaddTaskCallback()
to call the functionnanDetector()
on.Last.value
after each top-level task/calculation is completed. When it finds anNaN
in one of those returned values, it throws an error, which you can use to avoid any further computations.Among its shortcomings:
If you do something like setting
stop(error = recover)
, it's hard to tell where the error was triggered, since the error is always thrown from inside ofstopOnNaNs()
.When it throws an error,
stopOnNaNs()
is terminated before it can returnTRUE
. As a consequence, it is removed from the task list, and you'll need to reset withaddTaskCallback(stopOnNaNs)
it you want to use it again. (See the 'Arguments' section of ?addTaskCallback for more details).Without further ado, here it is:
恐怕没有这样的捷径。理论上,在 unix 上,您可以捕获
SIGFPE
,但实际上,feenableexcept
、AIX 上的fp_enable_all
等)或需要为目标 CPU 使用汇编程序也就是说,如果您足够努力(禁用 SSE 等),您可以自己编写一个 R,它会捕获您的平台和 CPU 的一些异常。我们不会考虑将其构建到 R 中,但出于特殊目的它可能是可行的。
但是,除非您更改 R 内部代码,否则它仍然无法捕获
NaN
/NA
操作。此外,您必须检查您正在使用的每个包,因为它们可能在其 C 代码中使用 FP 运算,并且还可能单独处理NA
/NaN
。如果您只担心除以零或上溢/下溢之类的问题,则上述内容将起作用,并且可能是最接近解决方案之类的问题。
仅检查结果可能不太可靠,因为您不知道结果是否基于某些中间
NaN
计算,该计算更改了可能不需要为NaN
的聚合值代码> 也是如此。如果您愿意放弃这种情况,那么您可以简单地递归遍历结果对象或工作区。这不应该是非常低效的,因为你只需要担心REALSXP
而不是其他任何事情(除非你也不喜欢NA
- 那么你会有更多工作)。这是可用于递归遍历 R 对象的示例代码:
I fear there is no such shortcut. In theory on unix there is
SIGFPE
that you could trap on, but in practicefeenableexcept
on Linux,fp_enable_all
on AIX etc.) or requires the use of assembler for your target CPUNaN
s,NA
s and handles them separately so they won't make it to the FP codeThat said, you could hack yourself an R that will catch some exceptions for your platform and CPU if you tried hard enough (disable SSE etc.). It is not something we would consider building into R, but for a special purpose it may be doable.
However, it would still not catch
NaN
/NA
operations unless you change R internal code. In addition, you would have to check every single package you are using since they may be using FP operations in their C code and may also handleNA
/NaN
separately.If you are only worried about things like division by zero or over/underflows, the above will work and is probably the closest to something like a solution.
Just checking your results may not be very reliable, because you don't know whether a result is based on some intermediate
NaN
calculation that changed an aggregated value which may not need to beNaN
as well. If you are willing to discard such case, then you could simply walk recursively through your result objects or the workspace. That should not be extremely inefficient, because you only need to worry aboutREALSXP
and not anything else (unless you don't likeNA
s either - then you'd have more work).This is an example code that could be used to traverse R object recursively: