读取 csv 文件进行数据库导入时 ColdFusion 内存峰值

发布于 2024-12-23 04:34:17 字数 1107 浏览 2 评论 0原文

我们有一个 ColdFusion 9 脚本,它定期运行,读取 CSV 文件并将记录插入 Oracle 11g 数据库。 CSV 文件大约有 50 列,其中 8 列由 CF 使用(CSV 格式无法修改)。 CFM 的一般流程是:

  1. 将文件读入变量
  2. CFLOOP,使用该变量作为列表属性,并以 CHR(13) 作为分隔符
  3. 调用 Oracle 存储过程,使用 ListGetAt 从文件中插入各种值

存储过程执行以下操作:

  1. 该 将包含 2 个字段的记录插入表 1
  2. 将包含 8 个字段(包括表 1 的主键)的记录插入表 2
  3. 不返回任何内容

在大多数情况下,这会成功运行,只需一个时间即可读取包含数百条记录的 400 KB 文件几秒钟。然而,有时我们会得到一个很大的卷并最终得到一个 13k 记录的 5MB 文件。当我们尝试处理这么大的文件时,我发现 JVM 内存使用量在 10-15 秒内从 90MB 上升到大约 680MB,之后 CF 服务器监视器停止响应(CF 也是如此),迫使我们重新启动服务。日志报告 JVM 内存不足错误:

“错误”、“qtp4795249-38798”、“11/12/28”、“16:29:20”、、“超出 GC 开销限制” java.lang.OutOfMemoryError:超出 GC 开销限制

我们的 JVM 堆大小当前为 768MB。我还没有尝试增加它,因为即使这确实解决了这个问题,它也不能在将来保护我们,并且服务器的其余正常负载不需要那么多。我犹豫是否要过多地使用需要重新启动才能在生产环境中生效的 JVM 设置。

这很难测试,因为导入过程运行良好,在我的本地开发计算机和我们的 QA 机器上几乎没有任何明显的内存负载,但这两者与数据库的连接速度要慢得多,需要 10-15 分钟才能完成。

我很感激任何想法,尤其是关于记忆的去向。我不明白 5MB 的数据是如何变成 700MB 的数据的。我们确实打开了调试信息,但调用脚本的 IP 地址不在调试列表中,我使用 cfsetting 标记关闭了此页面的调试。之前有一个步骤 1.5 将 CSV 数据转换为 ColdFusion 查询,但为了提高效率,我删除了该步骤。这两种方式都会导致 oom 错误。

We have a ColdFusion 9 script that runs regularly reading a CSV file and inserting the records into an Oracle 11g database. The CSV file has around 50 columns, 8 of which are used by CF (the CSV format cannot be modified). The general flow of the CFM is:

  1. Read file into variable
  2. CFLOOP using that variable as the list attribute with CHR(13) as a delimiter
  3. Call an Oracle stored procedure inserting various values from the file using ListGetAt

The stored procedure does the following:

  1. Inserts a record with 2 fields into table 1
  2. Inserts a record with 8 fields (including table 1's primary key) into table 2
  3. Returns nothing

This runs successfully in most cases, reading 400 KB files with hundreds of records in just a couple seconds. However, occasionally we get a large volume and end up with a 13k record 5MB file. When we try to process a file this large I watch the JVM memory usage go from 90MB up to around 680MB over a 10-15 second period, after which the CF Server Monitor stops responding (as does CF) forcing us to restart the service. Logs report a JVM out-of-memory error:

"Error","qtp4795249-38798","12/28/11","16:29:20",,"GC overhead limit exceeded"
java.lang.OutOfMemoryError: GC overhead limit exceeded

Our JVM heap size is currently 768MB. I haven't tried increasing it as even if that does resolve this issue it doesn't protect us in the future and the rest of the server's normal load doesn't require nearly that much. And I hesitate to play too much with JVM settings that require a restart to go into effect on a production box.

This is difficult to test as the import process runs fine with barely any noticeable memory load on my local dev machine and on our QA box, but both of those have a much slower connection to the database and take 10-15 minutes to complete.

I would appreciate any thoughts, especially about where the memory is going. I cannot figure out how 5MB of data gets turned into 700 MB of data. We do have debugging information turned on, but the calling script's IP address is not in the debugging list and I used a cfsetting tag to turn debugging off for this page. Previously there was a step 1.5 that turned the CSV data into a ColdFusion query, but I eliminated that in an attempt to improve efficiency. Both ways result in an oom error.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

你穿错了嫁妆 2024-12-30 04:34:18

您是否考虑过直接导入数据库?对于 MySQL,这是 LOAD DATA INFILE,对于 SQL Server它是批量插入。如果您需要一些额外的处理,可能的方法是将数据加载到临时表,然后使用 CFML 处理它,这可以轻松地批量完成以进行繁重的处理。

Have you considered direct import to database? For MySQL this is LOAD DATA INFILE, for SQL Server it is BULK INSERT. If you need some additional processing than possible approach would be loading data to the temporary table and then handling it with CFML, which can easily be done in batches for heavy processing.

沉鱼一梦 2024-12-30 04:34:18

在开始处理文件之前,不要将整个文件读入内存,而是通过一次读取一行来循环遍历文件内容。 Ben Nadel(当然)有一篇很好的博客文章讨论了这种技术:

使用 ColdFusion 的 CFLoop 一次读取一行文件数据 标签
或者 Java 的 LineNumberReader

另请参阅 cfloop 的 livedocs,特别是文件属性:

Rather than read the whole file into memory before you start processing it, loop through the file contents by reading it one line at a time. Ben Nadel (of course) has a nice blog post discussing this technique:

Reading In File Data One Line At A Time Using ColdFusion's CFLoop Tag
Or Java's LineNumberReader

Also see the livedocs for cfloop, specifically the file attribtue:

狼性发作 2024-12-30 04:34:18

我们有一个 CF 应用程序,可以导入房地产 MLS 列表并遇到类似的问题。我们使用的核心文件大小为 100MB,读入并立即循环会产生很多问题。我们最终做了一些事情:

  1. 将文件分割成块。导入过程使用 cygwin 中的 split.exe 实用程序将文件拆分为 4,000 行块。然后,我们使用 CFDIRECTORY 获取块列表并一次处理它们。

  2. 对于每个块,我们读入它,然后将文件内容拆分为一个数组(使用 listToArray() 和 chr(13) 作为分隔符)。

  3. 我们从 1 循环到 arrayLen(chunkArray),而不是直接循环文件内容。这样做更多的是为了速度。在该循环中,我们还将每一行分解为一个数组。我们发现这样做并访问 thisRow[i](其中 i 是文件中的列号)的值比重复调用 listGetAt() 要快得多。我们导入的文件有 90 多列。

  4. 我们增加了 JVM 内存限额。我们的服务器非常繁忙,这增加了一些开销。我们最终将 JVM 推至 32 位服务器的最高值(大约 2GB),以便在需要时可以使用内存。

We have a CF app which imports Real Estate MLS listings and ran into similar issues. The core file we work with is pushing 100MB, and reading it in and looping over it at once created lots of problems. We ended up doing a few things:

  1. Split the file up into chunks. The import process uses the split.exe utility from cygwin to split the file into 4,000 line chunks. We then use CFDIRECTORY to get a list of the chunks and process them one at a time.

  2. For each chunk, we read it in and then split the file contents into an array (using listToArray() with chr(13) as the delimiter).

  3. We loop from 1 to arrayLen(chunkArray) rather than looping over the file contents directly. This was done more for speed than anything else. Within that loop we also break each row into an array. We found that doing this and accessing the values as thisRow[i] (where i is the column number in the file) was much faster than repeatedly calling listGetAt(). The files we're importing have 90+ columns.

  4. We increased the JVM memory allowance. Our servers are pretty busy as it is, and this was adding some overhead. We ended up pushing the JVM as high as it would go for a 32-bit server (about 2GB) so that memory was available when needed.

木緿 2024-12-30 04:34:18

另请注意不要循环查询。执行查询所花费的最大时间就是建立数据库连接。

每当我必须从文件中进行多次这样的插入时,我都会创建每个 SQL 插入语句并将它们保存在用分号分隔的变量中。然后,我每 100 条语句一次执行所有这些语句。

我不得不通过这样做重写另一个程序员的一些程序,并且能够将处理时间缩短 90%。这是版本 6 中的情况,因此连接共享可能对此有所改进。

Also look at not looping over a query. Some of the biggest amount of time spent executing queries is just making the DB connection.

Anytime I have to do multiple inserts like this from a file, I create each SQL insert statement and save them in a variable seperated by a semicolon. I then execute all of those statements all at once every 100 statements.

I had to rewrite some procudures of another programmer by doing this and was able to cu the pocessing time by 90%. This was in version 6, so connections sharing may have improved this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文