解析巨大(超大)JSON 文件的最佳方法

发布于 2025-01-08 02:09:36 字数 590 浏览 0 评论 0原文

我正在尝试解析一些巨大的 JSON 文件(例如 http://eu.battle.net/auction-data/258993a3c6b974ef3e6f22ea6f822720/auctions.json)使用 gson 库 (http://code.google.com/p/google-gson/) 在JAVA。

我想知道解析这种大文件(大约 80k 行)的最佳方法是什么,以及您是否知道可以帮助我处理此文件的良好 API。

有些想法

  1. 是逐行读取并摆脱 JSON 格式:但那是无稽之谈。
  2. 通过将此文件拆分为许多其他文件来减少 JSON 文件:但我没有找到任何好的 Java API。
  3. 直接使用这个文件作为非Sql数据库,保留该文件并将其用作我的数据库。

I'm trying to parse some huge JSON file (like http://eu.battle.net/auction-data/258993a3c6b974ef3e6f22ea6f822720/auctions.json) using gson library (http://code.google.com/p/google-gson/) in JAVA.

I would like to know what is the best approach to parse this kind of big file (about 80k lines) and if you may know good API that can help me processing this.

Some ideas

  1. read line by line and get rid of the JSON format: but that's nonsense.
  2. reduce the JSON file by splitting this file into many other: but I did not find any good Java API for this.
  3. use this file directlly as nonSql database, keep the file and use it as my database.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

手长情犹 2025-01-15 02:09:36

我建议看看 Jackson Api 它很容易结合流式处理和树模型解析选项:您可以以流方式移动整个文件,然后将各个对象读入树结构。

作为示例,让我们采用以下内容输入:

{ 
  "records": [ 
    {"field1": "aaaaa", "bbbb": "ccccc"}, 
    {"field2": "aaa", "bbb": "ccc"} 
  ] ,
  "special message": "hello, world!" 
}

想象一下字段稀疏或记录具有更复杂的结构。

以下代码片段说明了如何使用流和树模型解析的组合来读取该文件。每个单独的记录都以树结构读取,但文件永远不会完整读入内存,从而可以在使用最少的内存的情况下处理千兆字节大小的 JSON 文件。

import org.codehaus.jackson.map.*;
import org.codehaus.jackson.*;

import java.io.File;

public class ParseJsonSample {
    public static void main(String[] args) throws Exception {
        JsonFactory f = new MappingJsonFactory();
        JsonParser jp = f.createJsonParser(new File(args[0]));
        JsonToken current;
        current = jp.nextToken();
        if (current != JsonToken.START_OBJECT) {
            System.out.println("Error: root should be object: quiting.");
            return;
        }
        while (jp.nextToken() != JsonToken.END_OBJECT) {
            String fieldName = jp.getCurrentName();
            // move from field name to field value
            current = jp.nextToken();
            if (fieldName.equals("records")) {
                if (current == JsonToken.START_ARRAY) {
                    // For each of the records in the array
                    while (jp.nextToken() != JsonToken.END_ARRAY) {
                        // read the record into a tree model,
                        // this moves the parsing position to the end of it
                        JsonNode node = jp.readValueAsTree();
                        // And now we have random access to everything in the object
                        System.out.println("field1: " + node.get("field1").getValueAsText());
                        System.out.println("field2: " + node.get("field2").getValueAsText());
                    }
                } else {
                    System.out.println("Error: records should be an array: skipping.");
                    jp.skipChildren();
                }
            } else {
                System.out.println("Unprocessed property: " + fieldName);
                jp.skipChildren();
            }
        }
    }
}

正如您所猜测的,每次调用 nextToken() 都会给出下一个解析事件:开始对象、开始字段、开始数组、开始对象、...、结束对象、...、结束数组、

... >jp.readValueAsTree() 调用允许将当前解析位置的内容(JSON 对象或数组)读取到 Jackson 的通用 JSON 树模型中。一旦有了这个,您就可以随机访问数据,而不管文件中出现的顺序如何(在示例中,field1 和 field2 并不总是相同的顺序)。 Jackson 也支持映射到您自己的 Java 对象。 jp.skipChildren() 很方便:它允许跳过完整的对象树或数组,而不必自己运行其中包含的所有事件。

I will suggest to have a look at Jackson Api it is very easy to combine the streaming and tree-model parsing options: you can move through the file as a whole in a streaming way, and then read individual objects into a tree structure.

As an example, let's take the following input:

{ 
  "records": [ 
    {"field1": "aaaaa", "bbbb": "ccccc"}, 
    {"field2": "aaa", "bbb": "ccc"} 
  ] ,
  "special message": "hello, world!" 
}

Just imagine the fields being sparse or the records having a more complex structure.

The following snippet illustrates how this file can be read using a combination of stream and tree-model parsing. Each individual record is read in a tree structure, but the file is never read in its entirety into memory, making it possible to process JSON files gigabytes in size while using minimal memory.

import org.codehaus.jackson.map.*;
import org.codehaus.jackson.*;

import java.io.File;

public class ParseJsonSample {
    public static void main(String[] args) throws Exception {
        JsonFactory f = new MappingJsonFactory();
        JsonParser jp = f.createJsonParser(new File(args[0]));
        JsonToken current;
        current = jp.nextToken();
        if (current != JsonToken.START_OBJECT) {
            System.out.println("Error: root should be object: quiting.");
            return;
        }
        while (jp.nextToken() != JsonToken.END_OBJECT) {
            String fieldName = jp.getCurrentName();
            // move from field name to field value
            current = jp.nextToken();
            if (fieldName.equals("records")) {
                if (current == JsonToken.START_ARRAY) {
                    // For each of the records in the array
                    while (jp.nextToken() != JsonToken.END_ARRAY) {
                        // read the record into a tree model,
                        // this moves the parsing position to the end of it
                        JsonNode node = jp.readValueAsTree();
                        // And now we have random access to everything in the object
                        System.out.println("field1: " + node.get("field1").getValueAsText());
                        System.out.println("field2: " + node.get("field2").getValueAsText());
                    }
                } else {
                    System.out.println("Error: records should be an array: skipping.");
                    jp.skipChildren();
                }
            } else {
                System.out.println("Unprocessed property: " + fieldName);
                jp.skipChildren();
            }
        }
    }
}

As you can guess, the nextToken() call each time gives the next parsing event: start object, start field, start array, start object, ..., end object, ..., end array, ...

The jp.readValueAsTree() call allows to read what is at the current parsing position, a JSON object or array, into Jackson's generic JSON tree model. Once you have this, you can access the data randomly, regardless of the order in which things appear in the file (in the example field1 and field2 are not always in the same order). Jackson supports mapping onto your own Java objects too. The jp.skipChildren() is convenient: it allows to skip over a complete object tree or an array without having to run yourself over all the events contained in it.

套路撩心 2025-01-15 02:09:36

你不需要切换到杰克逊。 Gson 2.1 引入了新的 TypeAdapter 接口,允许混合树和流序列化和反序列化。

API 高效且灵活。有关组合树和绑定的示例,请参阅 Gson 的 Streaming 文档模式。这绝对比混合流和树模式要好;通过绑定,您不会浪费内存来构建值的中间表示。

与 Jackson 一样,Gson 也有 API 可以递归地跳过不需要的值; Gson 将此称为 skipValue()

You don't need to switch to Jackson. Gson 2.1 introduced a new TypeAdapter interface that permits mixed tree and streaming serialization and deserialization.

The API is efficient and flexible. See Gson's Streaming doc for an example of combining tree and binding modes. This is strictly better than mixed streaming and tree modes; with binding you don't waste memory building an intermediate representation of your values.

Like Jackson, Gson has APIs to recursively skip an unwanted value; Gson calls this skipValue().

墨小墨 2025-01-15 02:09:36

声明式流映射 (DSM) 库允许您定义 JSON 或 XML 数据与 POJO 之间的映射。所以你不需要编写自定义解析器。它具有强大的脚本(Javascript、groovy、JEXL)支持。您可以在阅读时过滤和转换数据。读取数据时可以调用函数进行部分数据操作。 DSM 以 Stream 的形式读取数据,因此它使用的内存非常少。

例如,

{
    "company": {
         ....
        "staff": [
            {
                "firstname": "yong",
                "lastname": "mook kim",
                "nickname": "mkyong",
                "salary": "100000"
            },
            {
                "firstname": "low",
                "lastname": "yin fong",
                "nickname": "fong fong",
                "salary": "200000"
            }
        ]
    }
}

假设上面的代码片段是庞大且复杂的 JSON 数据的一部分。我们只想得到工资高于10000的东西。

首先,我们必须定义映射定义,如下所示。如您所见,它只是一个 yaml 文件,包含 POJO 字段和 JSON 数据字段之间的映射。

result:
      type: object     # result is map or a object.
      path: /.+staff  # path is regex. its match with /company/staff
      function: processStuff  # call processStuff function when /company/stuff tag is closed
      filter: self.data.salary>10000   # any expression is valid in JavaScript, Groovy or JEXL
      fields:
        name:  
          path: firstname
        sureName:
          path: lastname
        userName:
          path: nickname
        salary: long

为流程人员创建FunctionExecutor。

FunctionExecutor processStuff=new FunctionExecutor(){

            @Override
            public void execute(Params params) {

                // directly serialize Stuff class
                //Stuff stuff=params.getCurrentNode().toObject(Stuff.class);

                Map<String,Object> stuff= (Map<String,Object>)params.getCurrentNode().toObject();
                System.out.println(stuff);
                // process stuff ; save to db. call service etc.
            }
        };

使用DSM处理JSON

     DSMBuilder builder = new DSMBuilder(new File("path/to/mapping.yaml")).setType(DSMBuilder.TYPE.XML);

       // register processStuff Function
        builder.registerFunction("processStuff",processStuff);

        DSM dsm= builder.create();
        Object object =  dsm.toObject(xmlContent);

输出:(仅包含薪资高于 10000 的内容)

{firstName=low, lastName=yin fong, nickName=fong fong, salary=200000}

Declarative Stream Mapping (DSM) library allows you to define mappings between your JSON or XML data and your POJO. So you don't need to write a custom parser. İt has powerful scripting(Javascript, groovy, JEXL) support. You can filter and transform data while you are reading. You can call functions for partial data operation while you are reading data. DSM read data as a Stream so it uses very low memory.

For example,

{
    "company": {
         ....
        "staff": [
            {
                "firstname": "yong",
                "lastname": "mook kim",
                "nickname": "mkyong",
                "salary": "100000"
            },
            {
                "firstname": "low",
                "lastname": "yin fong",
                "nickname": "fong fong",
                "salary": "200000"
            }
        ]
    }
}

imagine the above snippet is a part of huge and complex JSON data. we only want to get stuff that has a salary higher than 10000.

First of all, we must define mapping definitions as follows. As you see, it is just a yaml file that contains the mapping between POJO fields and field of JSON data.

result:
      type: object     # result is map or a object.
      path: /.+staff  # path is regex. its match with /company/staff
      function: processStuff  # call processStuff function when /company/stuff tag is closed
      filter: self.data.salary>10000   # any expression is valid in JavaScript, Groovy or JEXL
      fields:
        name:  
          path: firstname
        sureName:
          path: lastname
        userName:
          path: nickname
        salary: long

Create FunctionExecutor for process staff.

FunctionExecutor processStuff=new FunctionExecutor(){

            @Override
            public void execute(Params params) {

                // directly serialize Stuff class
                //Stuff stuff=params.getCurrentNode().toObject(Stuff.class);

                Map<String,Object> stuff= (Map<String,Object>)params.getCurrentNode().toObject();
                System.out.println(stuff);
                // process stuff ; save to db. call service etc.
            }
        };

Use DSM to process JSON

     DSMBuilder builder = new DSMBuilder(new File("path/to/mapping.yaml")).setType(DSMBuilder.TYPE.XML);

       // register processStuff Function
        builder.registerFunction("processStuff",processStuff);

        DSM dsm= builder.create();
        Object object =  dsm.toObject(xmlContent);

Output: (Only stuff that has a salary higher than 10000 is included)

{firstName=low, lastName=yin fong, nickName=fong fong, salary=200000}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文