使用Hibernate的ScrollableResults缓慢读取9000万条记录

发布于 2024-09-01 13:22:24 字数 994 浏览 10 评论 0原文

我只需要使用 Hibernate 读取 MySQL 数据库表中的每一行，并根据它写入一个文件。但有 9000 万行，而且相当大。所以看起来下面的内容是合适的：

ScrollableResults results = session.createQuery("SELECT person FROM Person person")
            .setReadOnly(true).setCacheable(false).scroll(ScrollMode.FORWARD_ONLY);
while (results.next())
    storeInFile(results.get()[0]);

问题是上面的代码将尝试将所有 9000 万行加载到 RAM 中，然后再进入 while 循环...这将通过 OutOfMemoryError: Java heap space excepts 杀死我的内存:(所以

我想 ScrollableResults 不是我想要的？处理这个问题的正确方法是什么

？处理这个问题的方法是使用 setFirstResult 和 setMaxResults 来迭代结果，并且只使用常规的 Hibernate 结果而不是 ScrollableResults ，但这感觉效率很低，并且当我在 89 上调用 setFirstResult 时会开始花费相当长的时间。第百万行...

更新：setFirstResult/setMaxResults 不起作用，结果像我担心的那样需要很长的时间才能达到偏移量，这不是一个非常标准的过程吗？我愿意放弃 Hibernate 并使用 JDBC 或任何需要的东西。

更新2：我提出的解决方案工作正常，但不是很好，基本上是这样的形式：

select * from person where id > <offset> and <other_conditions> limit 1

由于我有其他条件，即使全部都在索引中，它仍然没有我希望的那么快。 ..所以仍然欢迎其他建议..

原文

I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate:

ScrollableResults results = session.createQuery("SELECT person FROM Person person")
            .setReadOnly(true).setCacheable(false).scroll(ScrollMode.FORWARD_ONLY);
while (results.next())
    storeInFile(results.get()[0]);

The problem is the above will try and load all 90 million rows into RAM before moving on to the while loop... and that will kill my memory with OutOfMemoryError: Java heap space exceptions :(.

So I guess ScrollableResults isn't what I was looking for? What is the proper way to handle this? I don't mind if this while loop takes days (well I'd love it to not).

I guess the only other way to handle this is to use setFirstResult and setMaxResults to iterate through the results and just use regular Hibernate results instead of ScrollableResults. That feels like it will be inefficient though and will start taking a ridiculously long time when I'm calling setFirstResult on the 89 millionth row...

UPDATE: setFirstResult/setMaxResults doesn't work, it turns out to take an unusably long time to get to the offsets like I feared. There must be a solution here! Isn't this a pretty standard procedure?? I'm willing to forgo Hibernate and use JDBC or whatever it takes.

UPDATE 2: the solution I've come up with which works ok, not great, is basically of the form:

select * from person where id > <offset> and <other_conditions> limit 1

Since I have other conditions, even all in an index, it's still not as fast as I'd like it to be... so still open for other suggestions..

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

红衣飘飘貌似仙 2024-09-08 13:22:24

据我所知，使用 setFirstResult 和 setMaxResults 是您唯一的选择。

传统上，可滚动结果集只会根据需要将行传输到客户端。不幸的是，MySQL Connector/J 实际上伪造了它，它执行整个查询并将其传输到客户端，因此驱动程序实际上将整个结果集加载到 RAM 中，并将其滴灌给您（内存不足问题就证明了这一点）。你的想法是正确的，这只是 MySQL java 驱动程序的缺点。

我发现没有办法解决这个问题，所以使用常规的 setFirst/max 方法加载大块。很抱歉带来坏消息。

只需确保使用无状态会话，这样就没有会话级缓存或脏跟踪等。

编辑：

除非您突破 MySQL J/Connector，否则您的 UPDATE 2 是您将获得的最好的。尽管您没有理由不能提高查询限制。如果您有足够的 RAM 来保存索引，这应该是一个比较便宜的操作。我会稍微修改一下，一次抓取一批，然后使用该批次的最高 id 来抓取下一批。

注意：只有当other_conditions使用相等（不允许范围条件）并且将索引的最后一列作为id时，这才有效。

select * 
from person 
where id > <max_id_of_last_batch> and <other_conditions> 
order by id asc  
limit <batch_size>

Using setFirstResult and setMaxResults is your only option that I'm aware of.

Traditionally a scrollable resultset would only transfer rows to the client on an as required basis. Unfortunately the MySQL Connector/J actually fakes it, it executes the entire query and transports it to the client, so the driver actually has the entire result set loaded in RAM and will drip feed it to you (evidenced by your out of memory problems). You had the right idea, it's just shortcomings in the MySQL java driver.

I found no way to get around this, so went with loading large chunks using the regular setFirst/max methods. Sorry to be the bringer of bad news.

Just make sure to use a stateless session so there's no session level cache or dirty tracking etc.

EDIT:

Your UPDATE 2 is the best you're going to get unless you break out of the MySQL J/Connector. Though there's no reason you can't up the limit on the query. Provided you have enough RAM to hold the index this should be a somewhat cheap operation. I'd modify it slightly, and grab a batch at a time, and use the highest id of that batch to grab the next batch.

Note: this will only work if other_conditions use equality (no range conditions allowed) and have the last column of the index as id.

select * 
from person 
where id > <max_id_of_last_batch> and <other_conditions> 
order by id asc  
limit <batch_size>

回复收藏 0 原文

﹏雨一样淡蓝的深情 2024-09-08 13:22:24

您应该能够使用 ScrollableResults，尽管它需要一些魔法才能使用 MySQL。我在博客文章中写下了我的发现（http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/），但我将在这里总结：

“ [JDBC] 文档说：

To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
                java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

这可以使用 Hibernate API 3.2+ 版本中的 Query 接口（这也适用于 Criteria）来完成：

Query query = session.createQuery(query);
query.setReadOnly(true);
// MIN_VALUE gives hint to JDBC driver to stream results
query.setFetchSize(Integer.MIN_VALUE);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
// iterate over results
while (results.next()) {
    Object row = results.get();
    // process row then release reference
    // you may need to evict() as well
}
results.close();

这允许您流式传输结果集，但是 Hibernate 仍将结果缓存在Session，因此如果您只是经常阅读，则需要调用 session.evict() 或 session.clear() 。数据，您可能会考虑使用 StatelessSession，但您应该事先阅读其文档。”

You should be able to use a ScrollableResults, though it requires a few magic incantations to get working with MySQL. I wrote up my findings in a blog post (http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/) but I'll summarize here:

"The [JDBC] documentation says:

To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
                java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);

This can be done using the Query interface (this should work for Criteria as well) in version 3.2+ of the Hibernate API:

Query query = session.createQuery(query);
query.setReadOnly(true);
// MIN_VALUE gives hint to JDBC driver to stream results
query.setFetchSize(Integer.MIN_VALUE);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
// iterate over results
while (results.next()) {
    Object row = results.get();
    // process row then release reference
    // you may need to evict() as well
}
results.close();

This allows you to stream over the result set, however Hibernate will still cache results in the Session, so you’ll need to call session.evict() or session.clear() every so often. If you are only reading data, you might consider using a StatelessSession, though you should read its documentation beforehand."

回复收藏 0 原文

德意的啸 2024-09-08 13:22:24

将查询中的提取大小设置为最佳值，如下所示。

另外，当不需要缓存时，使用 StatelessSession 可能会更好。

ScrollableResults results = session.createQuery("SELECT person FROM Person person")
        .setReadOnly(true)
        .setFetchSize( 1000 ) // <<--- !!!!
        .setCacheable(false).scroll(ScrollMode.FORWARD_ONLY)

Set fetch size in query to an optimal value as given below.

Also, when caching is not required, it may be better to use StatelessSession.

ScrollableResults results = session.createQuery("SELECT person FROM Person person")
        .setReadOnly(true)
        .setFetchSize( 1000 ) // <<--- !!!!
        .setCacheable(false).scroll(ScrollMode.FORWARD_ONLY)

回复收藏 0 原文

仄言 2024-09-08 13:22:24

FetchSize 必须为 Integer.MIN_VALUE，否则无法工作。

它必须从字面上取自官方参考： https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html

回复收藏 0 原文

诗酒趁年少 2024-09-08 13:22:24

实际上，如果您使用了这里提到的答案，您可能已经得到了您想要的东西——MySQL 的低内存可滚动结果：

使用 MySQL 流式传输大型结果集

请注意，您将遇到 Hibernate 延迟加载的问题，因为它会在滚动完成之前执行的任何查询引发异常。

回复收藏 0 原文

活泼老夫 2024-09-08 13:22:24

有 9000 万条记录，听起来您应该对 SELECT 进行批处理。当初始加载到分布式缓存中时，我已经使用了 Oracle。查看 MySQL 文档，等效的似乎是使用 LIMIT 子句： http://dev.mysql.com/doc/refman/5.0/en/select.html

下面是一个示例：

SELECT * from Person
LIMIT 200, 100

这将返回 Person 表的第 201 行到第 300 行。

您需要首先从表中获取记录计数，然后将其除以批量大小，并从中计算出循环和 LIMIT 参数。

这样做的另一个好处是并行性 - 您可以在此并行执行多个线程以加快处理速度。

处理 9000 万条记录听起来也不像使用 Hibernate 的最佳选择。

With 90 million records, it sounds like you should be batching your SELECTs. I've done with with Oracle when doing the initial load into a distrbuted cache. Looking at the MySQL documentation, the equivalent seems to be using the LIMIT clause: http://dev.mysql.com/doc/refman/5.0/en/select.html

Here's an example:

SELECT * from Person
LIMIT 200, 100

This would return rows 201 through 300 of the Person table.

You'd need to get the record count from your table first and then divide it by your batch size and work out your looping and LIMIT parameters from there.

The other benefit of this would be parallelism - you can execute multiple threads in parallel on this for faster processing.

Processing 90 million records also doesn't sound like the sweet spot for using Hibernate.

回复收藏 0 原文

最偏执的依靠 2024-09-08 13:22:24

我建议的不仅仅是一个示例代码，而是一个基于Hibernate 为您执行此解决方法（分页、滚动和清除 Hibernate 会话）。

它还可以轻松地适应使用EntityManager。

回复收藏 0 原文

残龙傲雪 2024-09-08 13:22:24

问题可能是，Hibernate 会保留对会话中所有对象的引用，直到您关闭会话为止。这与查询缓存无关。将对象写入文件后，也许从会话中 evict() 对象会有所帮助。如果会话不再引用它们，垃圾收集器可以释放内存，并且您将不会再耗尽内存。

回复收藏 0 原文

乱了心跳 2024-09-08 13:22:24

我之前已经成功使用过 Hibernate 滚动功能，但没有读取整个结果集。有人说 MySQL 没有真正的滚动游标，但它声称基于 JDBC dmd.supportsResultSetType(ResultSet.TYPE_SCROLL_INSENSITIVE) 并围绕它进行搜索好像其他人也用过。确保它没有缓存会话中的 Person 对象 - 我在没有实体可缓存的 SQL 查询中使用过它。您可以在循环末尾调用 evict 来确定或使用 sql 查询进行测试。还可以使用 setFetchSize 来优化到服务器的行程次数。

回复收藏 0 原文

愁杀 2024-09-08 13:22:24

最近我解决了这样的问题，我写了一篇关于如何面对这个问题的博客。非常喜欢，希望对任何人都有帮助。
我使用惰性列表方法和部分获取。 i 将查询的限制和偏移量或分页替换为手动分页。
在我的示例中，选择返回 1000 万条记录，我获取它们并将它们插入到“临时表”中：

create or replace function load_records ()
returns VOID as $
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
 join table2 t2 on (t2.fieldpk = t1.fieldpk)
 join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$ language plpgsql;

之后，我可以在不计算每一行的情况下进行分页，但使用分配的序列：

select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000

从 java 角度来看，我通过以下方式实现了此分页：使用惰性列表进行部分获取。这是一个从 Abstract list 扩展并实现 get() 方法的列表。 get方法可以使用数据访问接口继续获取下一组数据并释放内存堆：

@Override
public E get(int index) {
  if (bufferParcial.size() <= (index - lastIndexRoulette))
  {
    lastIndexRoulette = index;
    bufferParcial.removeAll(bufferParcial);
    bufferParcial = new ArrayList<E>();
        bufferParcial.addAll(daoInterface.getBufferParcial());
    if (bufferParcial.isEmpty())
    {
        return null;
    }

  }
  return bufferParcial.get(index - lastIndexRoulette);<br>
}

另一方面，数据访问接口使用查询分页并实现一个方法逐步迭代，每次25000条记录完成全部。

这种方法的结果可以在这里看到
http://www.arquitecturaysoftware.co/2013/10 /laboratorio-1-iterar-millones-de.html

recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one.
i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination.
In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":

create or replace function load_records ()
returns VOID as $
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
 join table2 t2 on (t2.fieldpk = t1.fieldpk)
 join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$ language plpgsql;

after that, i can paginate without count each row but using the sequence assigned:

select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000

From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:

@Override
public E get(int index) {
  if (bufferParcial.size() <= (index - lastIndexRoulette))
  {
    lastIndexRoulette = index;
    bufferParcial.removeAll(bufferParcial);
    bufferParcial = new ArrayList<E>();
        bufferParcial.addAll(daoInterface.getBufferParcial());
    if (bufferParcial.isEmpty())
    {
        return null;
    }

  }
  return bufferParcial.get(index - lastIndexRoulette);<br>
}

by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.

results for this approach can be seen here
http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html

回复收藏 0 原文