在 PHP 中处理大型 JSON 文件
我正在尝试处理稍大(可能高达 200M)的 JSON 文件。 文件的结构基本上是一个对象数组。
因此,大致如下:
[
{"property":"value", "property2":"value2"},
{"prop":"val"},
...
{"foo":"bar"}
]
每个对象都具有任意属性,并且不必与数组中的其他对象共享它们(例如,具有相同的属性)。
我想对数组中的每个对象进行处理,并且由于文件可能很大,我无法获取内存中的整个文件内容、解码 JSON 并迭代 PHP 数组。
因此,理想情况下,我想读取该文件,为每个对象获取足够的信息并处理它。 如果有一个类似的可用于 JSON 的库,那么 SAX 类型的方法就可以了。
关于如何最好地处理这个问题有什么建议吗?
I am trying to process somewhat large (possibly up to 200M) JSON files.
The structure of the file is basically an array of objects.
So something along the lines of:
[
{"property":"value", "property2":"value2"},
{"prop":"val"},
...
{"foo":"bar"}
]
Each object has arbitrary properties and does not necessary share them with other objects in the array (as in, having the same).
I want to apply a processing on each object in the array and as the file is potentially huge, I cannot slurp the whole file content in memory, decoding the JSON and iterating over the PHP array.
So ideally I would like to read the file, fetch enough info for each object and process it.
A SAX-type approach would be OK if there was a similar library available for JSON.
Any suggestion on how to deal with this problem best?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
我决定开发一个基于事件的解析器。它还没有完全完成,当我推出令人满意的版本时,将使用我的工作链接来编辑问题。
编辑:
我终于设计出了一个令我满意的解析器版本。它可以在 GitHub 上找到:
https://github.com/kuma-giyomu/JSONParser
可能还有一些改进的空间,我欢迎反馈。
I decided on working on an event based parser. It's not quite done yet and will edit the question with a link to my work when I roll out a satisfying version.
EDIT:
I finally worked out a version of the parser that I am satisfied with. It's available on GitHub:
https://github.com/kuma-giyomu/JSONParser
There's probably room for some improvement and am welcoming feedback.
最近我制作了一个名为 JSON Machine 的库,它可以有效地解析不可预测的大 JSON 文件。通过简单的
foreach
使用。我自己将它用于我的项目。示例:
请参阅 https://github.com/halaxa/json-machine
Recently I made a library called JSON Machine, which efficiently parses unpredictably big JSON files. Usage is via simple
foreach
. I use it myself for my project.Example:
See https://github.com/halaxa/json-machine
存在类似的东西,但仅适用于 C++ 和 Java。除非您可以从 PHP 访问这些库之一,否则据我所知,除了 json_read() 之外,PHP 中没有其他实现。但是,如果 json 的结构如此简单,则很容易读取文件直到下一个
}
,然后处理通过json_read()
接收到的 JSON。但你应该更好地进行缓冲,比如读取 10kb,用 } 分割,如果没有找到,再读取 10k,然后处理找到的值。然后阅读下一个块,依此类推。There exists something like this, but only for C++ and Java. Unless you can access one of these libraries from PHP, there's no implementation for this in PHP but
json_read()
as far as I know. However, if the json is structured that simple, it's easy to just read the file until the next}
and then process the JSON received viajson_read()
. But you should better do that buffered, like reading 10kb, split by }, if not found, read another 10k, and else process the found values. Then read the next block and so on..这是一个简单的流式解析器,用于处理大型 JSON 文档。使用它来解析非常大的 JSON 文档,以避免将整个内容加载到内存中,这就是几乎所有其他 PHP JSON 解析器的工作原理。
https://github.com/salsify/jsonstreamingparser
This is a simple, streaming parser for processing large JSON documents. Use it for parsing very large JSON documents to avoid loading the entire thing into memory, which is how just about every other JSON parser for PHP works.
https://github.com/salsify/jsonstreamingparser
有 http://github.com/sfalvo/php-yajl/ 我没有我自己用它。
There is http://github.com/sfalvo/php-yajl/ I didn't use it myself.
我知道 JSON 流解析器 https://github.com/salsify/jsonstreamingparser 已经被提及。但由于我最近(ish)添加了一个新的侦听器,以尝试使其更易于开箱即用,我想我会(进行更改)提供一些有关其功能的信息......
有一个非常关于基本解析器的好文章,位于 https://www. salsify.com/blog/engineering/json-streaming-parser-for-php,但我在标准设置中遇到的问题是您始终必须编写一个侦听器来处理文件。这并不总是一项简单的任务,如果/当 JSON 发生更改时,也可能需要一定量的维护。所以我编写了
RegexListener
。基本原则是允许您说出您感兴趣的元素(通过正则表达式)并给它一个回调以说明当它找到数据时要做什么。在读取 JSON 时,它会跟踪每个组件的路径 - 类似于目录结构。所以
/name/forename
或数组/items/item/2/partid
- 这就是正则表达式匹配的内容。一个示例是(来自 github 上的源代码)。 所以
只是几个解释...
/1
是数组中的第二个元素(从 0 开始),因此这允许访问元素的特定实例。/name
是name
元素。然后将该值作为$data
传递到闭包,这将选择数组的每个元素并一次传递一个元素,因为它使用捕获组,该信息将作为
传递$路径
。这意味着当文件中存在一组记录时,您可以一次处理每个项目。并且无需跟踪即可知道哪个元素。最后一个
有效地扫描任何称为嵌套数组的元素,并将每个元素及其在文档中的位置一起传递。
我发现的另一个有用的功能是,如果在一个大型 JSON 文件中,您只需要顶部的摘要详细信息,您可以抓取这些位,然后停止...
当您对其余内容不感兴趣时,这可以节省时间。
需要注意的一件事是,这些都会对内容做出反应,因此当找到匹配内容的末尾时,每一个都会被触发,并且可能会以不同的顺序。而且解析器只跟踪您感兴趣的内容并丢弃其他任何内容。
如果您发现任何有趣的功能(有时可怕地称为错误),请告诉我或在 github 页面上报告问题。
I know that the JSON streaming parser https://github.com/salsify/jsonstreamingparser has already been mentioned. But as I have recently(ish) added a new listener to it to try and make it easier to use out of the box I thought I would (for a change) put some information out about what it does...
There is a very good write up about the basic parser at https://www.salsify.com/blog/engineering/json-streaming-parser-for-php, but the issue I have with the standard setup was that you always had to write a listener to process a file. This is not always a simple task and can also take a certain amount of maintenance if/when the JSON changed. So I wrote the
RegexListener
.The basic principle is to allow you to say what elements you are interested in (via a regex expression) and give it a callback to say what to do when it finds the data. Whilst reading the JSON, it keeps track of the path to each component - similar to a directory structure. So
/name/forename
or for arrays/items/item/2/partid
- this is what the regex matches against.An example is (from the source on github)...
Just a couple of explanations...
So the
/1
is the the second element in an array (0 based), so this allows accessing particular instances of elements./name
is thename
element. The value is then passed to the closure as$data
This will select each element of an array and pass it one at a time, as it's using a capture group, this information will be passed as
$path
. This means when a set of records is present in a file, you can process each item one at a time. And also know which element without having to keep track.The last one
effectively scans for any elements called
nested array
and passes each one along with where it is in the document.Another useful feature I found was that if in a large JSON file, you just wanted the summary details at the top, you can grab those bits and then just stop...
This saves time when you are not interested in the remaining content.
One thing to note is that these will react to the content, so that each one is triggered when the end of the matching content is found and may be in various orders. But also that the parser only keeps track of the content you are interested in and discards anything else.
If you find any interesting features (sometimes horribly know as bugs), please let me know or report an issue on the github page.
我为 PHP 7 编写了一个流式 JSON 拉解析器 pcrov/JsonReader ,其 api 基于 XMLReader。
它与基于事件的解析器显着不同,因为您不是设置回调并让解析器执行其操作,而是调用解析器上的方法来根据需要移动或检索数据。找到您想要的位并想停止解析?然后停止解析(并调用
close()
因为这是一件好事。)(有关拉式解析器与基于事件的解析器的稍长一点的概述,请参阅 XML 读取器模型:SAX 与 XML 拉式解析器。)
示例 1 :
从 JSON 中读取每个对象作为一个整体。
输出:
对象以字符串键控数组的形式返回(部分原因)是由于有效 JSON 会生成 PHP 对象中不允许的属性名称的边缘情况。解决这些冲突是不值得的,因为一个贫乏的 stdClass 对象无论如何都不会比简单的数组带来任何价值。
示例 2:
单独读取每个命名元素。
输出:
示例 3:
读取给定名称的每个属性。奖励:从字符串而不是 URI 中读取,再加上从同一对象中具有重复名称的属性中获取数据(这在 JSON 中是允许的,多有趣。)
输出:
如何最好地读取 JSON 取决于其结构和内容你想用它做什么。这些示例应该为您提供一个起点。
I've written a streaming JSON pull parser pcrov/JsonReader for PHP 7 with an api based on XMLReader.
It differs significantly from event-based parsers in that instead of setting up callbacks and letting the parser do its thing, you call methods on the parser to move along or retrieve data as desired. Found your desired bits and want to stop parsing? Then stop parsing (and call
close()
because it's the nice thing to do.)(For a slightly longer overview of pull vs event-based parsers see XML reader models: SAX versus XML pull parser.)
Example 1:
Read each object as a whole from your JSON.
Output:
Objects get returned as stringly-keyed arrays due (in part) to edge cases where valid JSON would produce property names that are not allowed in PHP objects. Working around these conflicts isn't worthwhile as an anemic stdClass object brings no value over a simple array anyway.
Example 2:
Read each named element individually.
Output:
Example 3:
Read each property of a given name. Bonus: read from a string instead of a URI, plus get data from properties with duplicate names in the same object (which is allowed in JSON, how fun.)
Output:
How exactly to best read through your JSON depends on its structure and what you want to do with it. These examples should give you a place to start.