读取 csv 文件进行数据库导入时 ColdFusion 内存峰值

发布于 2024-12-23 04:34:17 字数 1107 浏览 2 评论 0原文

我们有一个 ColdFusion 9 脚本，它定期运行，读取 CSV 文件并将记录插入 Oracle 11g 数据库。 CSV 文件大约有 50 列，其中 8 列由 CF 使用（CSV 格式无法修改）。 CFM 的一般流程是：

将文件读入变量
CFLOOP，使用该变量作为列表属性，并以 CHR(13) 作为分隔符
调用 Oracle 存储过程，使用 ListGetAt 从文件中插入各种值

存储过程执行以下操作：

该将包含 2 个字段的记录插入表 1
将包含 8 个字段（包括表 1 的主键）的记录插入表 2
不返回任何内容

在大多数情况下，这会成功运行，只需一个时间即可读取包含数百条记录的 400 KB 文件几秒钟。然而，有时我们会得到一个很大的卷并最终得到一个 13k 记录的 5MB 文件。当我们尝试处理这么大的文件时，我发现 JVM 内存使用量在 10-15 秒内从 90MB 上升到大约 680MB，之后 CF 服务器监视器停止响应（CF 也是如此），迫使我们重新启动服务。日志报告 JVM 内存不足错误：

“错误”、“qtp4795249-38798”、“11/12/28”、“16:29:20”、、“超出 GC 开销限制” java.lang.OutOfMemoryError：超出 GC 开销限制

我们的 JVM 堆大小当前为 768MB。我还没有尝试增加它，因为即使这确实解决了这个问题，它也不能在将来保护我们，并且服务器的其余正常负载不需要那么多。我犹豫是否要过多地使用需要重新启动才能在生产环境中生效的 JVM 设置。

这很难测试，因为导入过程运行良好，在我的本地开发计算机和我们的 QA 机器上几乎没有任何明显的内存负载，但这两者与数据库的连接速度要慢得多，需要 10-15 分钟才能完成。

我很感激任何想法，尤其是关于记忆的去向。我不明白 5MB 的数据是如何变成 700MB 的数据的。我们确实打开了调试信息，但调用脚本的 IP 地址不在调试列表中，我使用 cfsetting 标记关闭了此页面的调试。之前有一个步骤 1.5 将 CSV 数据转换为 ColdFusion 查询，但为了提高效率，我删除了该步骤。这两种方式都会导致 oom 错误。

原文

We have a ColdFusion 9 script that runs regularly reading a CSV file and inserting the records into an Oracle 11g database. The CSV file has around 50 columns, 8 of which are used by CF (the CSV format cannot be modified). The general flow of the CFM is:

Read file into variable
CFLOOP using that variable as the list attribute with CHR(13) as a delimiter
Call an Oracle stored procedure inserting various values from the file using ListGetAt

The stored procedure does the following:

Inserts a record with 2 fields into table 1
Inserts a record with 8 fields (including table 1's primary key) into table 2
Returns nothing

This runs successfully in most cases, reading 400 KB files with hundreds of records in just a couple seconds. However, occasionally we get a large volume and end up with a 13k record 5MB file. When we try to process a file this large I watch the JVM memory usage go from 90MB up to around 680MB over a 10-15 second period, after which the CF Server Monitor stops responding (as does CF) forcing us to restart the service. Logs report a JVM out-of-memory error:

"Error","qtp4795249-38798","12/28/11","16:29:20",,"GC overhead limit exceeded"
java.lang.OutOfMemoryError: GC overhead limit exceeded

Our JVM heap size is currently 768MB. I haven't tried increasing it as even if that does resolve this issue it doesn't protect us in the future and the rest of the server's normal load doesn't require nearly that much. And I hesitate to play too much with JVM settings that require a restart to go into effect on a production box.

This is difficult to test as the import process runs fine with barely any noticeable memory load on my local dev machine and on our QA box, but both of those have a much slower connection to the database and take 10-15 minutes to complete.

I would appreciate any thoughts, especially about where the memory is going. I cannot figure out how 5MB of data gets turned into 700 MB of data. We do have debugging information turned on, but the calling script's IP address is not in the debugging list and I used a cfsetting tag to turn debugging off for this page. Previously there was a step 1.5 that turned the CSV data into a ColdFusion query, but I eliminated that in an attempt to improve efficiency. Both ways result in an oom error.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你穿错了嫁妆 2024-12-30 04:34:18

您是否考虑过直接导入数据库？对于 MySQL，这是 LOAD DATA INFILE，对于 SQL Server它是批量插入。如果您需要一些额外的处理，可能的方法是将数据加载到临时表，然后使用 CFML 处理它，这可以轻松地批量完成以进行繁重的处理。

回复收藏 0 原文

沉鱼一梦 2024-12-30 04:34:18

在开始处理文件之前，不要将整个文件读入内存，而是通过一次读取一行来循环遍历文件内容。 Ben Nadel（当然）有一篇很好的博客文章讨论了这种技术：

使用 ColdFusion 的 CFLoop 一次读取一行文件数据标签
或者 Java 的 LineNumberReader

另请参阅 cfloop 的 livedocs，特别是文件属性：

回复收藏 0 原文

狼性发作 2024-12-30 04:34:18

我们有一个 CF 应用程序，可以导入房地产 MLS 列表并遇到类似的问题。我们使用的核心文件大小为 100MB，读入并立即循环会产生很多问题。我们最终做了一些事情：

将文件分割成块。导入过程使用 cygwin 中的 split.exe 实用程序将文件拆分为 4,000 行块。然后，我们使用 CFDIRECTORY 获取块列表并一次处理它们。
对于每个块，我们读入它，然后将文件内容拆分为一个数组（使用 listToArray() 和 chr(13) 作为分隔符）。
我们从 1 循环到 arrayLen(chunkArray)，而不是直接循环文件内容。这样做更多的是为了速度。在该循环中，我们还将每一行分解为一个数组。我们发现这样做并访问 thisRow[i]（其中 i 是文件中的列号）的值比重复调用 listGetAt() 要快得多。我们导入的文件有 90 多列。
我们增加了 JVM 内存限额。我们的服务器非常繁忙，这增加了一些开销。我们最终将 JVM 推至 32 位服务器的最高值（大约 2GB），以便在需要时可以使用内存。