如何修复 DataImportHandler 中的 java OutOfMemoryError: Java 堆空间?
我正在尝试将大型数据集(4100 万条记录)导入到新的 Solr 索引中。我已经设置了核心,它可以工作,我插入了一些测试文档,它们可以工作。我已按如下方式设置 data-config.xml,然后开始完全导入。大约12小时后!导入失败。
文档大小可能会变得相当大,错误是否是由于文档(或字段)较大或由于进入 DataImportHandler 的数据量所致?
我怎样才能让这个令人沮丧的导入任务发挥作用!?!
我在下面包含了 tomcat 错误日志。
如果有任何我错过的信息,请告诉我!
日志:
Jun 1, 2011 5:47:55 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Creating a connection for entity results with URL: jdbc:sqlserver://myserver;databaseName=mydb;responseBuffering=adaptive;selectMethod=cursor
Jun 1, 2011 5:47:56 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Time taken for getConnection(): 1185
Jun 1, 2011 5:48:02 PM org.apache.solr.core.SolrCore execute
INFO: [results] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=0
...
Jun 2, 2011 5:16:32 AM org.apache.solr.common.SolrException log
SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:664)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringDecoder.decode(Unknown Source)
at java.lang.StringCoding.decode(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at com.microsoft.sqlserver.jdbc.DDC.convertStreamToObject(DDC.java:419)
at com.microsoft.sqlserver.jdbc.ServerDTVImpl.getValue(dtv.java:1974)
at com.microsoft.sqlserver.jdbc.DTV.getValue(dtv.java:175)
at com.microsoft.sqlserver.jdbc.Column.getValue(Column.java:113)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getValue(SQLServerResultSet.java:1982)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getValue(SQLServerResultSet.java:1967)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getObject(SQLServerResultSet.java:2256)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getObject(SQLServerResultSet.java:2265)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.getARow(JdbcDataSource.java:286)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$700(JdbcDataSource.java:228)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.next(JdbcDataSource.java:266)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.next(JdbcDataSource.java:260)
at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:78)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591)
... 5 more
Jun 2, 2011 5:16:32 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Jun 2, 2011 5:16:44 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://myserver;databaseName=mydb;responseBuffering=adaptive;selectMethod=cursor"
user="sa"
password="password"/>
<document>
<entity name="results" query="SELECT fielda, fieldb, fieldc FROM mydb.[dbo].mytable WITH (NOLOCK)">
<field column="fielda" name="fielda"/><field column="fieldb" name="fieldb"/><field column="fieldc" name="fieldc"/>
</entity>
</document>
</dataConfig>
solrconfig.xml 片段:
<indexDefaults>
<useCompoundFile>false</useCompoundFile>
<mergeFactor>25</mergeFactor>
<ramBufferSizeMB>128</ramBufferSizeMB>
<maxFieldLength>100000</maxFieldLength>
<writeLockTimeout>10000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
</indexDefaults>
<mainIndex>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>128</ramBufferSizeMB>
<mergeFactor>25</mergeFactor>
<infoStream file="INFOSTREAM.txt">true</infoStream>
</mainIndex>
Java 配置设置:init mem 128mb,最大 512mb
环境: 索尔3.1 雄猫7.0.12 视窗服务器2008 java:v6更新25(内部版本1.6.0_25-b06) (数据来自:sql 2008 r2)
/admin/stats.jsp - DataImportHandler
Status : IDLE
Documents Processed : 2503083
Requests made to DataSource : 1
Rows Fetched : 2503083
Documents Deleted : 0
Documents Skipped : 0
Total Documents Processed : 0
Total Requests made to DataSource : 0
Total Rows Fetched : 0
Total Documents Deleted : 0
Total Documents Skipped : 0
handlerStart : 1306759913518
requests : 9
errors : 0
编辑:我当前正在运行sql查询来找出最大的单个记录的字段长度,因为我认为这可能是异常的原因。另外,使用 jconsole 再次运行 import 以监视堆使用情况。
编辑:阅读 solr 性能因素页面。将 maxFieldLength 更改为 1000000 并更改 ramBufferSizeMB = 256。现在进行另一次导入运行(是的...)
I am trying to import a large dataset (41million records) into a new Solr index. I have setup the core, it works, I inserted some test docs, they work. I have setup the data-config.xml as below and then I start the full-import. After about 12 hours! the import fails.
The document size can get quite large, could the error be because of a large document (or field) or due to the volume of data going into the DataImportHandler?
How can I get this frustrating import task working!?!
I have included the tomcat error log below.
Let me know if there is any info i have missed!
logs:
Jun 1, 2011 5:47:55 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Creating a connection for entity results with URL: jdbc:sqlserver://myserver;databaseName=mydb;responseBuffering=adaptive;selectMethod=cursor
Jun 1, 2011 5:47:56 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 call
INFO: Time taken for getConnection(): 1185
Jun 1, 2011 5:48:02 PM org.apache.solr.core.SolrCore execute
INFO: [results] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=0
...
Jun 2, 2011 5:16:32 AM org.apache.solr.common.SolrException log
SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:664)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:267)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:186)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:353)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:411)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:392)
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringDecoder.decode(Unknown Source)
at java.lang.StringCoding.decode(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at com.microsoft.sqlserver.jdbc.DDC.convertStreamToObject(DDC.java:419)
at com.microsoft.sqlserver.jdbc.ServerDTVImpl.getValue(dtv.java:1974)
at com.microsoft.sqlserver.jdbc.DTV.getValue(dtv.java:175)
at com.microsoft.sqlserver.jdbc.Column.getValue(Column.java:113)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getValue(SQLServerResultSet.java:1982)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getValue(SQLServerResultSet.java:1967)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getObject(SQLServerResultSet.java:2256)
at com.microsoft.sqlserver.jdbc.SQLServerResultSet.getObject(SQLServerResultSet.java:2265)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.getARow(JdbcDataSource.java:286)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$700(JdbcDataSource.java:228)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.next(JdbcDataSource.java:266)
at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.next(JdbcDataSource.java:260)
at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:78)
at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:591)
... 5 more
Jun 2, 2011 5:16:32 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Jun 2, 2011 5:16:44 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
data-config.xml:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
url="jdbc:sqlserver://myserver;databaseName=mydb;responseBuffering=adaptive;selectMethod=cursor"
user="sa"
password="password"/>
<document>
<entity name="results" query="SELECT fielda, fieldb, fieldc FROM mydb.[dbo].mytable WITH (NOLOCK)">
<field column="fielda" name="fielda"/><field column="fieldb" name="fieldb"/><field column="fieldc" name="fieldc"/>
</entity>
</document>
</dataConfig>
solrconfig.xml snippet:
<indexDefaults>
<useCompoundFile>false</useCompoundFile>
<mergeFactor>25</mergeFactor>
<ramBufferSizeMB>128</ramBufferSizeMB>
<maxFieldLength>100000</maxFieldLength>
<writeLockTimeout>10000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
</indexDefaults>
<mainIndex>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>128</ramBufferSizeMB>
<mergeFactor>25</mergeFactor>
<infoStream file="INFOSTREAM.txt">true</infoStream>
</mainIndex>
Java config settings: init mem 128mb, max 512mb
Environment:
solr 3.1
tomcat 7.0.12
windows server 2008
java: v6 update 25 (build 1.6.0_25-b06)
(data coming from:sql 2008 r2)
/admin/stats.jsp - DataImportHandler
Status : IDLE
Documents Processed : 2503083
Requests made to DataSource : 1
Rows Fetched : 2503083
Documents Deleted : 0
Documents Skipped : 0
Total Documents Processed : 0
Total Requests made to DataSource : 0
Total Rows Fetched : 0
Total Documents Deleted : 0
Total Documents Skipped : 0
handlerStart : 1306759913518
requests : 9
errors : 0
EDIT: I am currently running a sql query to find out the largest single record's field length, as I think this is probably cause of exception. Also, running import again with jconsole to monitor heap usage.
EDIT: Read solr performance factors page. changing maxFieldLength to 1000000 and changing ramBufferSizeMB = 256. Now for another import run (yay...)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
很明显 MS JDBC 驱动程序内存不足。许多 JDBC 驱动程序可以默认一次性在内存中获取所有结果。因此,看看是否可以对此进行调整,或者考虑使用开源 JTDS 驱动程序,该驱动程序通常表现得更好,无论如何
我不相信 maxfieldlength 会帮助您 - 这会影响 Lucene 截断的数量,但不会影响最初传输的数量。另一种选择是一次仅传输一个选择,例如 100 万个,使用 TOP 和 ROWNUMBER 等进行分页。
makes it pretty obvious that The MS JDBC driver is running out of ram. Many JDBC drivers can default to fetching all their results at once in memory. So see if this can be tuned or consider using the opensource JTDS driver which is generally better behaved anyway
I don't believe maxfieldlength is gonna help you - that will affect how much Lucene truncates, but not how much is initially transferred. Another option is to only transfer a selection at a time, say a 1 million, using TOP and ROWNUMBER and such for paging.
通过手动批处理查询,我能够使用 jdbc 成功将大型表导入到 Sql Server,而不会遇到内存不足错误。在我的例子中,有 256 个批次:
data-config.xml:
父实体 batch 是一个递归 sql server CTE,返回字节值 0-255。这用于过滤子实体结果。
注1: where 条件(即 CONVERT(varbinary(1),fielda) = ${batch.BatchId} )应根据类型和内容进行调整分区字段,以便将结果分区为相等的批次。例如,如果 fielda 是数字,则使用 fielda % 255 = ${batch.BatchId}。就我而言,fielda 是一个唯一标识符,因此第一个字节就足够了。
注2:batch实体上的rootEntity="false"是必填项,表示batch实体是<强>不是根文档。
I was able to successfully import a large table using jdbc to Sql Server without running into out of memory errors by batching the queries manually. In my case, in 256 batches:
data-config.xml:
The parent entity batch is a recursive sql server CTE that returns byte values 0-255. This is used to filter the child entity results.
Note 1: The where condition (ie. CONVERT(varbinary(1),fielda) = ${batch.BatchId} ) should be adjusted depending on the type and contents of the partitioning field in order to partition the results into equal batches. e.g. Use fielda % 255 = ${batch.BatchId} if fielda is a number. In my case fielda was a uniqueidentifier so the first byte was sufficient.
Note 2: The rootEntity="false" on the batch entity is required and indicates the batch entity is NOT a root document.
对于 mysql,它可以工作
根据 solr wiki,
DataImportHandler 被设计为逐一流式传输行。它将获取大小值(默认值:500)传递给 Statement#setFetchSize,某些驱动程序不支持该值。对于 MySQL,将 batchSize 属性添加到 dataSource 配置,值为 -1。这会将 Integer.MIN_VALUE 作为获取大小传递给驱动程序,并防止大型表内存不足。
应该看起来像:
For mysql, it works
As per solr wiki,
DataImportHandler is designed to stream row one-by-one. It passes a fetch size value (default: 500) to Statement#setFetchSize which some drivers do not honor. For MySQL, add batchSize property to dataSource configuration with value -1. This will pass Integer.MIN_VALUE to the driver as the fetch size and keep it from going out of memory for large tables.
Should look like:
尝试这个cargo cult修复OutOfMemoryError:增加堆大小
如果你能负担得起的话,让它 -Xmx1024M 或更多,初始 512M -Xms512M
Try this cargo cult fix for OutOfMemoryError: increase heap size
make it -Xmx1024M or more if you can afford that and initial 512M -Xms512M
我发现对于大型记录集,事情可能会变得有点棘手。你确实有几个选择。
快速且最可能的最佳选择是为索引系统分配更多内存!内存(大部分)非常便宜。
我可能尝试的另一件事是对数据进行分块。
根据“文档”的大小,41M 文档可能会让您在搜索时陷入困境。您可能想要对文档进行分片。使用 DIH 时,我尝试使用查询字符串参数来促进分区。您可以通过在查询语句中使用 ${dataimporter.request.MYPARAMETERNAME} 引用传递到 DHI 的适当参数来执行此操作。
I've found that with large record sets things can get a bit hairy. You do have a couple options.
The Quick and most likely best option would be to allocate more memory to the indexing system! Memory (for the most part) is pretty cheap.
The other thing I might try is chunking the data.
Depending on the size of your "documents" 41M docs may bog you down when it comes to searching. You may want to shard the documents. When using DIH I try to facilitate partitioning using querystring parameter. You can do this by referencing the appropriate parameter passed into DHI using the ${dataimporter.request.MYPARAMETERNAME} within the query statement.