解析巨大(超大)JSON 文件的最佳方法
我正在尝试解析一些巨大的 JSON 文件(例如 http://eu.battle.net/auction-data/258993a3c6b974ef3e6f22ea6f822720/auctions.json)使用 gson 库 (http://code.google.com/p/google-gson/) 在JAVA。
我想知道解析这种大文件(大约 80k 行)的最佳方法是什么,以及您是否知道可以帮助我处理此文件的良好 API。
有些想法
- 是逐行读取并摆脱 JSON 格式:但那是无稽之谈。
- 通过将此文件拆分为许多其他文件来减少 JSON 文件:但我没有找到任何好的 Java API。
- 直接使用这个文件作为非Sql数据库,保留该文件并将其用作我的数据库。
I'm trying to parse some huge JSON file (like http://eu.battle.net/auction-data/258993a3c6b974ef3e6f22ea6f822720/auctions.json) using gson library (http://code.google.com/p/google-gson/) in JAVA.
I would like to know what is the best approach to parse this kind of big file (about 80k lines) and if you may know good API that can help me processing this.
Some ideas
- read line by line and get rid of the JSON format: but that's nonsense.
- reduce the JSON file by splitting this file into many other: but I did not find any good Java API for this.
- use this file directlly as nonSql database, keep the file and use it as my database.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我建议看看 Jackson Api 它很容易结合流式处理和树模型解析选项:您可以以流方式移动整个文件,然后将各个对象读入树结构。
作为示例,让我们采用以下内容输入:
想象一下字段稀疏或记录具有更复杂的结构。
以下代码片段说明了如何使用流和树模型解析的组合来读取该文件。每个单独的记录都以树结构读取,但文件永远不会完整读入内存,从而可以在使用最少的内存的情况下处理千兆字节大小的 JSON 文件。
正如您所猜测的,每次调用 nextToken() 都会给出下一个解析事件:开始对象、开始字段、开始数组、开始对象、...、结束对象、...、结束数组、
... >jp.readValueAsTree() 调用允许将当前解析位置的内容(JSON 对象或数组)读取到 Jackson 的通用 JSON 树模型中。一旦有了这个,您就可以随机访问数据,而不管文件中出现的顺序如何(在示例中,field1 和 field2 并不总是相同的顺序)。 Jackson 也支持映射到您自己的 Java 对象。 jp.skipChildren() 很方便:它允许跳过完整的对象树或数组,而不必自己运行其中包含的所有事件。
I will suggest to have a look at Jackson Api it is very easy to combine the streaming and tree-model parsing options: you can move through the file as a whole in a streaming way, and then read individual objects into a tree structure.
As an example, let's take the following input:
Just imagine the fields being sparse or the records having a more complex structure.
The following snippet illustrates how this file can be read using a combination of stream and tree-model parsing. Each individual record is read in a tree structure, but the file is never read in its entirety into memory, making it possible to process JSON files gigabytes in size while using minimal memory.
As you can guess, the nextToken() call each time gives the next parsing event: start object, start field, start array, start object, ..., end object, ..., end array, ...
The
jp.readValueAsTree()
call allows to read what is at the current parsing position, a JSON object or array, into Jackson's generic JSON tree model. Once you have this, you can access the data randomly, regardless of the order in which things appear in the file (in the example field1 and field2 are not always in the same order). Jackson supports mapping onto your own Java objects too. The jp.skipChildren() is convenient: it allows to skip over a complete object tree or an array without having to run yourself over all the events contained in it.你不需要切换到杰克逊。 Gson 2.1 引入了新的 TypeAdapter 接口,允许混合树和流序列化和反序列化。
API 高效且灵活。有关组合树和绑定的示例,请参阅 Gson 的 Streaming 文档模式。这绝对比混合流和树模式要好;通过绑定,您不会浪费内存来构建值的中间表示。
与 Jackson 一样,Gson 也有 API 可以递归地跳过不需要的值; Gson 将此称为 skipValue()。
You don't need to switch to Jackson. Gson 2.1 introduced a new TypeAdapter interface that permits mixed tree and streaming serialization and deserialization.
The API is efficient and flexible. See Gson's Streaming doc for an example of combining tree and binding modes. This is strictly better than mixed streaming and tree modes; with binding you don't waste memory building an intermediate representation of your values.
Like Jackson, Gson has APIs to recursively skip an unwanted value; Gson calls this skipValue().
声明式流映射 (DSM) 库允许您定义 JSON 或 XML 数据与 POJO 之间的映射。所以你不需要编写自定义解析器。它具有强大的脚本(Javascript、groovy、JEXL)支持。您可以在阅读时过滤和转换数据。读取数据时可以调用函数进行部分数据操作。 DSM 以 Stream 的形式读取数据,因此它使用的内存非常少。
例如,
假设上面的代码片段是庞大且复杂的 JSON 数据的一部分。我们只想得到工资高于10000的东西。
首先,我们必须定义映射定义,如下所示。如您所见,它只是一个 yaml 文件,包含 POJO 字段和 JSON 数据字段之间的映射。
为流程人员创建FunctionExecutor。
使用DSM处理JSON
Declarative Stream Mapping (DSM) library allows you to define mappings between your JSON or XML data and your POJO. So you don't need to write a custom parser. İt has powerful scripting(Javascript, groovy, JEXL) support. You can filter and transform data while you are reading. You can call functions for partial data operation while you are reading data. DSM read data as a Stream so it uses very low memory.
For example,
imagine the above snippet is a part of huge and complex JSON data. we only want to get stuff that has a salary higher than 10000.
First of all, we must define mapping definitions as follows. As you see, it is just a yaml file that contains the mapping between POJO fields and field of JSON data.
Create FunctionExecutor for process staff.
Use DSM to process JSON