JDBC 快速加载 - 2000 万次 addBatch() 操作后内存不足
我正在使用 Teradata jdbc 驱动程序 13.00.00.10,并尝试将包含 1 亿行的平面文件上传到 Teradata。
我从一张干净的桌子开始。
首先,我尝试迭代整个文件,对每一行执行 addBatch() 操作,最后只执行一次 executeBatch() 操作:
while ((s = reader.readLine())!=null ){
String[] columns = StringUtils.split(s, separator);
for (int j=0; j <columns.length; j++){
st.setString(j+1,columns[j]);
}
st.addBatch();
i++;
if (i % 10000 ==0 ){
ULogger.info(this, "imported " + i + " lines.");
}
}
st.executeBatch();
这很快就会消耗我的应用程序的所有内存。
我设置了 9GB XMX,在大约 4000 万次 addBatch() 之后出现了 OutOfMemory。
然后我尝试定期执行executeBatch() - 迭代文件,每2000万个addBatch()执行一次executeBatch()。
while ((s = reader.readLine())!=null ){
String[] columns = StringUtils.split(s, separator);
for (int j=0; j <columns.length; j++){
st.setString(j+1,columns[j]);
}
st.addBatch();
i++;
if (i % 20000000 ==0 ){
st.executeBatch();
st.clearWarnings();
}
}
st.executeBatch();
在这种情况下,第一个executeBatch()成功了。
但是,第二个executeBatch() 失败,并显示“开始快速加载数据库表XXX 时出错”。
谁能解释一下我应该如何加载 1 亿行?
是否有我缺少的配置(例如告诉驱动程序定期推送一些更新而不是将它们保存在内存中)?
谢谢,
A.
I am using Teradata jdbc driver 13.00.00.10, and trying to upload a flat file with 100 million rows to teradata.
I start off with a clean table.
First I tried to iterate through the entire file, do addBatch() for each of the rows, and only in the end do a single executeBatch():
while ((s = reader.readLine())!=null ){
String[] columns = StringUtils.split(s, separator);
for (int j=0; j <columns.length; j++){
st.setString(j+1,columns[j]);
}
st.addBatch();
i++;
if (i % 10000 ==0 ){
ULogger.info(this, "imported " + i + " lines.");
}
}
st.executeBatch();
This quickly consumes all the memory for my application.
I set the 9GB XMX, and got OutOfMemory after ~40 million addBatch().
Then I tried to do periodic executeBatch() - iterate throgh the file and every 20 million addBatch() do an executeBatch().
while ((s = reader.readLine())!=null ){
String[] columns = StringUtils.split(s, separator);
for (int j=0; j <columns.length; j++){
st.setString(j+1,columns[j]);
}
st.addBatch();
i++;
if (i % 20000000 ==0 ){
st.executeBatch();
st.clearWarnings();
}
}
st.executeBatch();
In this case, the first executeBatch() succeeded.
However, The second executeBatch() failed, with "errror while beginning FastLoad of database table XXX".
Can anyone explain how I'm supposed to load 100 million rows?
Is there a configuration I am missing (e.g. to tell the driver to push some updates periodically and not hold them in memory)?
Thanks,
A.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在 Teradata 用户论坛上得到了答案,该论坛将我引导至常见问题解答 - (http://developer.teradata.com/connectivity/faq),其中指出正确的工作方法是定期调用executeBatch();
但是,为了做到这一点,您需要首先关闭连接上的自动提交,
并确保最终承诺。
所以我的代码现在看起来像这样:
工作起来就像一个魅力。
常见问题解答中唯一没有 100% 清楚的是块大小应该是多少。
它说一些令人困惑的事情,比如“你的 JVM 堆可以承受的大小”以及“将来我们将建议将其设置为 50K-100K。这两个似乎相互矛盾,我将其设置为 1000 万,它是 另外-
Teradata JDBC 中显然有一个 CSV 快速加载器功能 - 请参阅 http://developer.teradata.com/doc/connectivity/jdbc/reference/current/jdbcug_chapter_2.html#BABFGFA2 - 一般而言,这可能会提供更好的性能。
Got an answer on the Teradata user forum, which directed me to the FAQ - (http://developer.teradata.com/connectivity/faq) where it says that the right way to work is by calling executeBatch() periodically;
however, in order to do that, you need to first turn off autocommit on the connection,
and make sure to commit in the end.
so my code looks like this now:
works like a charm.
The only thing which is not 100% clear from the FAQ is what should be the chunk size.
it says something confusing like "as large as your JVM heap can sustain" and also "in future we will recommend to set it to be 50K-100K. These two seem to contradict each other, I've set it to 10 million and it's working find.
Also - there is apparently a CSV fastloader feature in Teradata JDBC - see http://developer.teradata.com/doc/connectivity/jdbc/ reference/current/jdbcug_chapter_2.html#BABFGFA2 -which may give better performance in general.