当前位置：文江博客话题详情

XML r memory-leaks

迭代解析 XML 文件时出现严重内存泄漏

发布于 2025-01-03 19:39:34 字数 6712 浏览 1 评论 0 原文

上下文

当迭代加载、分析的一组 Rdata 文件（每个文件包含 HTML 代码的字符向量）时（通过 XML 功能），然后再次从内存中删除，我体验到显着增加 R进程的内存消耗（最终杀死进程）。

看起来

通过 free() 释放对象，
通过 rm() 删除它们并
运行 gc()

没有任何效果，所以内存消耗会累积，直到没有更多内存为止。

编辑 2012-02-13 23:30:00

感谢包 XML，Duncan Temple Lang（再次：我真的非常感谢！），这个问题似乎与外部指针的释放方式以及如何释放密切相关垃圾集合在 XML 包中处理。 Duncan 发布了该包的错误修复版本 (3.92-0)，该版本整合了解析 XML 和 HTML 的某些方面，并具有改进的垃圾收集功能，不再需要通过 free( 显式释放包含外部指针的对象）。您可以在 Duncan 的 Omegahat 网站上找到源代码和 Windows 二进制文件。

编辑 2012-02-13 23:34:00

不幸的是，新的软件包版本似乎仍然没有解决我在我放在一起的小例子中遇到的问题。我遵循了一些建议并稍微简化了示例，使其更容易掌握并找到出现问题的相关函数（检查函数 ./lib/exampleRun.R 和 . lib/scrape.R）。

编辑 2012-02-14 15:00:00

邓肯建议尝试通过 .Call("RS_XML_forceFreeDoc", html) 显式强制释放已解析的文档。我在示例中包含了一个逻辑开关（脚本 ./scripts/memory.R 中的 do.forcefree），如果设置为 TRUE >，就会这样做。不幸的是，这导致我的 R 控制台崩溃。如果有人可以在他们的机器上验证这一点，那就太好了！实际上，当使用最新版本的 XML 时，文档应该自动释放（见上文）。事实上，这似乎不是一个错误（根据邓肯的说法）。

编辑 2012-02-14 23:12:00

Duncan 将另一个版本的 XML (3.92-1) 推送到他的 Omegahat 网站 Omegahat 网站。这应该可以解决一般问题。然而，我的例子似乎不太走运，因为我仍然遇到同样的内存泄漏。

编辑2012-02-17 20:39:00>>解决方案！

是的！邓肯发现并修复了这个错误！这是仅限 Windows 的脚本中的一个小拼写错误，这解释了为什么该错误没有在 Linux、Mac OS 等中显示。查看最新版本 3.92-2。！现在，在迭代解析和处理 XML 文件时，内存消耗尽可能保持恒定！

再次特别感谢邓肯·坦普尔·朗 (Duncan Temple Lang) 并感谢所有回答此问题的其他人！

>>>>>原始问题的遗留部分<<<

示例说明（编辑于 2012-02-14 15:00:00）

从我的 'memory' rel="noreferrer">Github 存储库。
打开脚本 ./scripts/memory.R 并在第 6 行处设置 a) 您的工作目录，b) 示例范围也在第16行处 c) 是否在第22行处强制释放已解析的文档。请注意，您仍然可以找到旧脚本；它们在文件名末尾由“LEGACY”“标记”。
运行脚本。
研究最新文件 ./memory_.txt 以查看记录的内存状态随时间的增加。我已经包含了两个由我自己的测试运行产生的文本文件。

我在内存控制方面所做的事情

确保在每次迭代结束时通过 rm() 再次删除加载的对象。
解析 XML 文件时，我设置了参数 addFinalizer=TRUE，在通过 free() 释放 C 指针之前删除了所有引用已解析 XML 文档的 R 对象并删除包含外部指针的对象。
到处添加一个gc()。
尝试遵循 Duncan Temple Lang 的注释中的建议关于使用 XML 包时的内存管理（我必须承认我没有完全理解那里所说的内容）

编辑2012-02-13 23:42:00： 正如我上面指出的，不再需要显式调用 free() 和 rm()，因此我对这些调用进行了注释。

系统信息

Windows XP 32 位，4 GB RAM
Windows 7 32 位，2 GB RAM
Windows 7 64 位，4 GB RAM
R 2.14.1
XML 3.9-4
XML 3.92-0，参见 http://www.omegahat.org/RSXML/

初步调查结果2012-02-09 01:00:00

在多台机器上运行网络抓取场景（请参阅上面的“系统信息”部分）在大约 180 - 350 次迭代后（取决于操作系统和 RAM）总是会破坏我的 R 进程的内存消耗。
运行普通 rdata 场景会产生恒定的内存消耗当且仅当您在每次迭代中通过 gc() 设置对垃圾收集器的显式调用;否则，您会遇到与网络抓取场景中相同的行为。

问题

知道是什么导致内存增加吗？
有什么想法可以解决这个问题吗？

截至 2012-02-013 23:44:00 的调查结果

在多台计算机上运行 ./scripts/memory.R 中的示例（请参阅上面的“系统信息”部分）仍然会破坏我的内存消耗R 进程经过大约 180 - 350 次迭代（取决于操作系统和 RAM）。

内存消耗仍然明显增加，尽管仅看数字时可能看起来并没有那么多，但我的 R 进程总是在某个时候因此而终止。

下面，我发布了在具有 2 GB RAM 的 WinXP 32 位机器上运行示例的几个时间序列：

TS_1 (XML 3.9-4, 2012-02-09)

29.07 33.32 30.55 35.32 30.76 30.94 31.13 31.33 35.44 32.34 33.21 32.18 35.46 35.73 35.76 35.68 35.84 35.6 33.49 33.58 33.71 33.82 33.91 34.04 34.15 34.23 37.85 34.68 34.88 35.05 35.2 35.4 35.52 35.66 35.81 35.91 38.08 36.2

TS_2（XML 3.9-4，2012-02-09）

28.54 30.13 32.95 30.33 30.43 30.54 35.81 30.99 32.78 31.37 31.56 35.22 31.99 32.22 32.55 32.66 32.84 35.32 33.59 33.32 33.47 33.58 33.69 33.76 33.87 35.5 35.52 34.24 37.67 34.75 34.92 35.1 37.97 35.43 35.57 35.7 38.12 35.98

与 TS_2 TS_3 关联的错误消息

[...]
Scraping html page 30 of ~/data/rdata/132.rdata
Scraping html page 31 of ~/data/rdata/132.rdata
error : Memory allocation failed : growing buffer
error : Memory allocation failed : growing buffer
I/O error : write error
Scraping html page 32 of ~/data/rdata/132.rdata
Fehler in htmlTreeParse(file = obj[x.html], useInternalNodes = TRUE, addFinalizer =     TRUE): 
 error in creating parser for (null)
> Synch18832464393836

（XML 3.92-0，2012-02-13）

20.1 24.14 24.47 22.03 25.21 25.54 23.15 23.5 26.71 24.6 27.39 24.93 28.06 25.64 28.74 26.36 29.3 27.07 30.01 27.77 28.13 31.13 28.84 31.79 29.54 32.4 30.25 33.07 30.96 33.76 31.66 34.4 32.37 35.1 33.07 35.77 38.23 34.16 34.51 34.87 35.22 35.58 35.93 40.54 40.9 41.33 41.6

与 TS_3 相关的错误消息

[...]
---------- status: 31.33 % ----------

Scraping html page 1 of 50
Scraping html page 2 of 50
[...]
Scraping html page 36 of 50
Scraping html page 37 of 50
Fehler: 1: Memory allocation failed : growing buffer
2: Memory allocation failed : growing buffer

2012-02-17 编辑：请帮助我验证计数器值

如果您可以运行以下代码，您将帮我一个大忙。 花费的时间不会超过 2 分钟。您需要做的就是

下载 Rdata 文件并将其保存为seed.Rdata。
下载包含我的抓取函数的脚本并将其另存为scrape.R。
相应设置工作目录后，获取以下代码。

代码：

setwd("set/path/to/your/wd")
install.packages("XML", repos="http://www.omegahat.org/R")
library(XML)
source("scrape.R")
load("seed.rdata")
html <- htmlParse(obj[1], asText = TRUE)
counter.1 <- .Call("R_getXMLRefCount", html)
print(counter.1)
z <- scrape(html)
gc()
gc()
counter.2 <- .Call("R_getXMLRefCount", html)
print(counter.2)
rm(html)
gc()
gc()

我对 counter.1 和 counter.2 的值特别感兴趣，它们应该是1在两次通话中。事实上，邓肯在所有机器上都进行了测试。然而，事实证明 counter.2 在我的所有机器上都有值 259（请参阅上面的详细信息），这正是导致我的问题的原因。

原文

Context

When iterating over a set of Rdata files (each containing a character vector of HTML code) that are loaded, analyzed (via XML functionality) and then removed from memory again, I experience a significant increase in an
R process' memory consumption (killing the process eventually).

It just seems like

freeing objects via free(),
removing them via rm() and
running gc()

do not have any effects, so the memory consumption cumulates until there's no more memory left.

EDIT 2012-02-13 23:30:00

Thanks to valuable insight shared by the author and maintainer of package XML, Duncan Temple Lang (again: I really appreciate it very much!), the problem seems to be closely related to the way external pointers are freed and how garbage collection is handled in the XML package. Duncan issued a bug-fixed version of the package (3.92-0) that consolidated certain aspects of parsing XML and HTML and features an improved garbage collection where it's not necessary anymore to explicitly free the object containing the external pointer via free(). You find the source code and a Windows binary at Duncan's Omegahat website.

EDIT 2012-02-13 23:34:00

Unfortunately, the new package version still does not seem to fix the issues I'm encountering in the little little example that I've put together. I followed some suggestion and simplified the example a bit, making it easier to grasp and to find the relevant functions where things seem to go wrong (check functions ./lib/exampleRun.R and .lib/scrape.R).

EDIT 2012-02-14 15:00:00

Duncan suggested trying to force to free the parsed document explicitly via .Call("RS_XML_forceFreeDoc", html). I've included a logical switch in the example (do.forcefree in script ./scripts/memory.R) that, if set to TRUE, will do just that. Unfortunately, this made my R console crash. It'd be great if someone could verify this on their machine! Actually, the doc should be freed automatically when using the latest version of XML (see above). The fact that it isn't seems to be a bug (according to Duncan).

EDIT 2012-02-14 23:12:00

Duncan pushed yet another version of XML (3.92-1) to his Omegahat website Omegahat website. This should fix the issue in general. However, I seem to be out of luck with my example as I still experience the same memory leakage.

EDIT 2012-02-17 20:39:00 > SOLUTION!

YES! Duncan found and fixed the bug! It was a little typo in a Windows-only script which explained why the bug didn't show in Linux, Mac OS etc. Check out the latest version 3.92-2.! Memory consumption is now as constant as can be when iteratively parsing and processing XML files!

Special thanks again to Duncan Temple Lang and thanks to everyone else that responded to this question!

>>> LEGACY PARTS OF THE ORIGINAL QUESTION <<<

Example Instructions (edited 2012-02-14 15:00:00)

Download folder 'memory' from my Github repo.
Open up the script ./scripts/memory.R and set a) your working directory at line 6, b) the example scope at line 16 as well c) whether to force the freeing of the parsed doc or not at line 22. Note that you can still find the old scripts; they are "tagged" by an "LEGACY" at the end of the filename.
Run the script.
Investigate the latest file ./memory_<TIMESTAMP>.txt to see the increase in logged memory states over time. I've included two text files that resulted from my own test runs.

Things I've done with respect to memory control

making sure a loaded object is removed again via rm() at the end of each iteration.
When parsing XML files, I've set argument addFinalizer=TRUE, removed all R objects that have a reference to the parsed XML doc before freeing the C pointer via free() and removing the object containing the external pointer.
adding a gc() here and there.
trying to follow the advice in Duncan Temple Lang's notes on memory management when using its XML package (I have to admit though that I did not fully comprehend what's stated there)

EDIT 2012-02-13 23:42:00:
As I pointed out above, explicit calls to free() followed by rm() should not be necessary anymore, so I commented these calls out.

System Info

Windows XP 32 Bit, 4 GB RAM
Windows 7 32 Bit, 2 GB RAM
Windows 7 64 Bit, 4 GB RAM
R 2.14.1
XML 3.9-4
XML 3.92-0 as found at http://www.omegahat.org/RSXML/

Initial Findings as of 2012-02-09 01:00:00

Running the webscraping scenario on several machines (see section "System Info" above) always busts the memory consumption of my R process after about 180 - 350 iterations (depending on OS and RAM).
Running the plain rdata scenario yields constant memory consumption if and only if you set an explicit call to the garbage collector via gc() in each iteration; else you experience the same behavior as in the webscraping scenario.

Questions

Any idea what's causing the memory increase?
Any ideas how to work around this?

Findings as of 2012-02-013 23:44:00

Running the example in ./scripts/memory.R on several machines (see section "System Info" above) still busts the memory consumption of my R process after about 180 - 350 iterations (depending on OS and RAM).

There's still an evident increase in memory consumption and even though it may not appear to be that much when just looking at the numbers, my R processes always died at some point due to this.

Below, I've posted a couple of time series that resulted from running my example on a WinXP 32 Bit box with 2 GB RAM:

TS_1 (XML 3.9-4, 2012-02-09)

29.07
33.32
30.55
35.32
30.76
30.94
31.13
31.33
35.44
32.34
33.21
32.18
35.46
35.73
35.76
35.68
35.84
35.6
33.49
33.58
33.71
33.82
33.91
34.04
34.15
34.23
37.85
34.68
34.88
35.05
35.2
35.4
35.52
35.66
35.81
35.91
38.08
36.2

TS_2 (XML 3.9-4, 2012-02-09)

28.54
30.13
32.95
30.33
30.43
30.54
35.81
30.99
32.78
31.37
31.56
35.22
31.99
32.22
32.55
32.66
32.84
35.32
33.59
33.32
33.47
33.58
33.69
33.76
33.87
35.5
35.52
34.24
37.67
34.75
34.92
35.1
37.97
35.43
35.57
35.7
38.12
35.98

Error Message associated to TS_2

[...]
Scraping html page 30 of ~/data/rdata/132.rdata
Scraping html page 31 of ~/data/rdata/132.rdata
error : Memory allocation failed : growing buffer
error : Memory allocation failed : growing buffer
I/O error : write error
Scraping html page 32 of ~/data/rdata/132.rdata
Fehler in htmlTreeParse(file = obj[x.html], useInternalNodes = TRUE, addFinalizer =     TRUE): 
 error in creating parser for (null)
> Synch18832464393836

TS_3 (XML 3.92-0, 2012-02-13)

20.1
24.14
24.47
22.03
25.21
25.54
23.15
23.5
26.71
24.6
27.39
24.93
28.06
25.64
28.74
26.36
29.3
27.07
30.01
27.77
28.13
31.13
28.84
31.79
29.54
32.4
30.25
33.07
30.96
33.76
31.66
34.4
32.37
35.1
33.07
35.77
38.23
34.16
34.51
34.87
35.22
35.58
35.93
40.54
40.9
41.33
41.6

Error Message associated to TS_3

[...]
---------- status: 31.33 % ----------

Scraping html page 1 of 50
Scraping html page 2 of 50
[...]
Scraping html page 36 of 50
Scraping html page 37 of 50
Fehler: 1: Memory allocation failed : growing buffer
2: Memory allocation failed : growing buffer

Edit 2012-02-17: please help me verifying counter value

You'd do me a huge favor if you could run the following code.
It won't take more than 2 minutes of your time.
All you need to do is

Download an Rdata file and save it as seed.Rdata.
Download the script containing my scraping function and save it as scrape.R.
Source the following code after setting the working directory accordingly.

Code:

setwd("set/path/to/your/wd")
install.packages("XML", repos="http://www.omegahat.org/R")
library(XML)
source("scrape.R")
load("seed.rdata")
html <- htmlParse(obj[1], asText = TRUE)
counter.1 <- .Call("R_getXMLRefCount", html)
print(counter.1)
z <- scrape(html)
gc()
gc()
counter.2 <- .Call("R_getXMLRefCount", html)
print(counter.2)
rm(html)
gc()
gc()

I'm particularly interested in the values of counter.1 and counter.2 which should be 1 in both calls. In fact, it is on all machines that Duncan has tested this on. However, as it turns out counter.2 has value 259 on all of my machines (see details above) and that's exactly what's causing my problem.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

樱花细雨 2025-01-10 19:39:34

从 XML 包的网页来看，作者 Duncan Temple Lang 似乎相当广泛地描述了某些内存管理问题。请参阅此页面：“XML 包中的内存管理”。

老实说，我不熟悉您的代码和包的详细情况，但我认为您可以在该页面中找到答案，特别是在名为 "问题"，或与 Duncan Temple Lang 直接沟通。

更新 1。可能的一个想法是使用 multicore 和 foreach 包（即 listResults = foreach(ix = 1: N) %dopar% {yourprocessing;return(listElement)} 我认为对于 Windows，您需要 doSMP，或者在 Linux 下可能需要 doRedis；，我用无论如何，通过并行加载，您将获得更快的吞吐量，我认为您可能会从内存使用中获得一些好处，因为分叉 R 可能会导致不同的结果。内存清理，因为每个生成的进程在完成后都会被杀死。这不能保证有效，但它可以解决内存和速度问题，

但请注意：doSMP有其自己的特性（即您可以。仍然存在一些内存问题）。关于 SO 的其他问答提到了一些问题，但我仍然会尝试一下。

回复收藏 0 原文

べ映画 2025-01-10 19:39:34

我在使用 XML 包时也遇到过类似的问题。 R 使用的内存量不断膨胀，以至于我的计算机崩溃了。这个答案解决了我的问题，我只是设置了 addFinalizer = F< /代码>。

这是一个最小的可重现示例：

library(tidyverse)
library(XML)

url <- "https://en.wikipedia.org/wiki/Main_Page"
httr::GET(url) %>% base::saveRDS("html.rds")

运行其他操作之前的内存使用情况：

运行以下命令后的内存使用情况：

for(i in 1:10000){
    base::readRDS(file = "html.rds") %>% 
        XML::htmlParse(., asText=TRUE) %>% 
        XML::xpathSApply(., path = "//h1", xmlValue, addFinalizer = F)
}

删除addFinalizer = F（默认）后的内存使用情况：

for(i in 1:10000){
    base::readRDS(file = "html.rds") %>% 
        XML::htmlParse(., asText=TRUE) %>% 
        XML::xpathSApply(., path = "//h1", xmlValue)
}

I've experienced similar issues with the XML package. The amount of memory being used by R was ballooning, to the point where my computer would crash. This answer solved my problem, I just set addFinalizer = F.

Here's a minimum reproducible example:

library(tidyverse)
library(XML)

url <- "https://en.wikipedia.org/wiki/Main_Page"
httr::GET(url) %>% base::saveRDS("html.rds")

Memory usage before running anything else:

Memory usage after running the following:

for(i in 1:10000){
    base::readRDS(file = "html.rds") %>% 
        XML::htmlParse(., asText=TRUE) %>% 
        XML::xpathSApply(., path = "//h1", xmlValue, addFinalizer = F)
}

Memory usage after removing addFinalizer = F (the default):

for(i in 1:10000){
    base::readRDS(file = "html.rds") %>% 
        XML::htmlParse(., asText=TRUE) %>% 
        XML::xpathSApply(., path = "//h1", xmlValue)
}

回复收藏 0 原文

南城追梦 2025-01-10 19:39:34

@Rappster 当我第一次检查并确保 XML 文档存在，然后调用 C 函数来实现内存时，我的 R 不会崩溃。

 for (i in 1:1000) {

  pXML<-xmlParse(file)

if(exists("pXML")){
  .Call("RS_XML_forceFreeDoc", pXML)
                  }
}

@Rappster My R doesn't crash when I first check and make sure the XML doc exists and then call the C function for realizing the memory.

 for (i in 1:1000) {

  pXML<-xmlParse(file)

if(exists("pXML")){
  .Call("RS_XML_forceFreeDoc", pXML)
                  }
}

回复收藏 0 原文

~没有更多了~

关于作者

屌丝范

暂无简介

文章

900 人气

关注发私信

友情链接

文江博客

迭代解析 XML 文件时出现严重内存泄漏

上下文

编辑 2012-02-13 23:30:00

编辑 2012-02-13 23:34:00

编辑 2012-02-14 15:00:00

编辑 2012-02-14 23:12:00

编辑2012-02-17 20:39:00>>解决方案！

>>>>>原始问题的遗留部分<<<

示例说明（编辑于 2012-02-14 15:00:00）

我在内存控制方面所做的事情

系统信息

初步调查结果2012-02-09 01:00:00

问题

截至 2012-02-013 23:44:00 的调查结果

TS_1 (XML 3.9-4, 2012-02-09)

TS_2（XML 3.9-4，2012-02-09）

与 TS_2 TS_3 关联的错误消息

（XML 3.92-0，2012-02-13）

与 TS_3 相关的错误消息

2012-02-17 编辑：请帮助我验证计数器值

Context

EDIT 2012-02-13 23:30:00

EDIT 2012-02-13 23:34:00

EDIT 2012-02-14 15:00:00

EDIT 2012-02-14 23:12:00

EDIT 2012-02-17 20:39:00 > SOLUTION!

>>> LEGACY PARTS OF THE ORIGINAL QUESTION <<<

Example Instructions (edited 2012-02-14 15:00:00)

Things I've done with respect to memory control

System Info

Initial Findings as of 2012-02-09 01:00:00

Questions

Findings as of 2012-02-013 23:44:00

TS_1 (XML 3.9-4, 2012-02-09)

TS_2 (XML 3.9-4, 2012-02-09)

Error Message associated to TS_2

TS_3 (XML 3.92-0, 2012-02-13)

Error Message associated to TS_3

Edit 2012-02-17: please help me verifying counter value

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

檐前雨

鹿港巷口少年归

qq_32QL4xcD

sum_

DLL

唐婉

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。