在 Java 中将 XML 文件转换为 CSV
这是一个 XML 示例(案例 1):
<root>
<Item>
<ItemID>4504216603</ItemID>
<ListingDetails>
<StartTime>10:00:10.000Z</StartTime>
<EndTime>10:00:30.000Z</EndTime>
<ViewItemURL>http://url</ViewItemURL>
....
</item>
这是一个 XML 示例(案例 2):
<Item>
<ItemID>4504216604</ItemID>
<ListingDetails>
<StartTime>10:30:10.000Z</StartTime>
<!-- Start difference from case 1 -->
<averages>
<AverageTime>value1</AverageTime>
<category type="TX">9823</category>
<category type="TY">9112</category>
<AveragePrice>value2</AveragePrice>
</averages>
<!-- End difference from case 1 -->
<EndTime>11:00:10.000Z</EndTime>
<ViewItemURL>http://url</ViewItemURL>
....
</item>
</root>
我从 Google 借用了这个 XML。我的对象并不总是相同的,有时会有额外的元素,比如 case2。现在我想从这两种情况生成这样的 CSV:
ItemID,StartTime,EndTime,ViewItemURL,AverageTime,AveragePrice
4504216603,10:00:10.000Z,10:00:30.000Z,http://url
4504216604,10:30:10.000Z,11:00:10.000Z,http://url,value1,value2
第一行是标题,它也应该包含在 csv 中。我今天得到了一些有用的 stax 链接,我真的不知道什么是正确/最佳的方法,我正在寻求想法。
更新 1
我忘了提及这是一个巨大的 XML,文件高达 1GB。
更新 2
我正在寻找更通用的方法,这意味着这应该适用于任何深度的任意数量的节点,有时如示例 XML 中所示,一个 item
对象可能具有更大的数量节点数比下一个/上一个节点多,因此也应该存在这种情况(因此所有列和值在 CSV 中都匹配)。
另外,节点也可能具有相同的名称/本地名称,但具有不同的值和属性,如果是这种情况,则新列应以适当的值出现在 CSV 中。 (我在名为 category
的
标记中添加了这种情况的示例)
Here is an example XML (case 1) :
<root>
<Item>
<ItemID>4504216603</ItemID>
<ListingDetails>
<StartTime>10:00:10.000Z</StartTime>
<EndTime>10:00:30.000Z</EndTime>
<ViewItemURL>http://url</ViewItemURL>
....
</item>
Here is an example XML (case 2) :
<Item>
<ItemID>4504216604</ItemID>
<ListingDetails>
<StartTime>10:30:10.000Z</StartTime>
<!-- Start difference from case 1 -->
<averages>
<AverageTime>value1</AverageTime>
<category type="TX">9823</category>
<category type="TY">9112</category>
<AveragePrice>value2</AveragePrice>
</averages>
<!-- End difference from case 1 -->
<EndTime>11:00:10.000Z</EndTime>
<ViewItemURL>http://url</ViewItemURL>
....
</item>
</root>
I borrowed this XML from Google. My objects are not always the same, sometimes there are extra elements like in case2. Now I'd like to produce CSV like this from both cases:
ItemID,StartTime,EndTime,ViewItemURL,AverageTime,AveragePrice
4504216603,10:00:10.000Z,10:00:30.000Z,http://url
4504216604,10:30:10.000Z,11:00:10.000Z,http://url,value1,value2
This 1st line is header it should also be included in csv. I got some useful links to stax today, I don't really don't know what is the right/optimal approach for this, and I am seeking ideas.
Update 1
I forgot to mention this is a huge XML, file up to 1gb.
Update 2
I'm looking for more Generic approach, meaning that this should work for any number of nodes with any depth, and sometimes as in the example XML, it can happen that one item
object has greater number of nodes than the next/previous one so there should be also case for that (so all columns and values match in CSV).
Also it can happen that nodes have the same name/localName but different values and attributes, if that is the case then new column should appear in CSV with appropriate value. (I added example of this case inside <averages>
tag called category
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
请注意,这将是使用 XSLT 的一个主要示例,除了大多数 XSLT 处理器将整个 XML 文件读入内存中之外,这不是一个选项,因为它很大。但请注意,Saxon 的企业版可以进行流式 XSLT 处理(如果 XSLT 脚本遵守限制)。
您可能还想在 JVM 之外使用外部 XSLT 处理器(如果适用)。这提供了更多选择。
Saxon-EE 中的流式传输: http://www.saxonica.com/documentation/sourcedocs/序列号.html
Note that this would be a prime example of using XSLT except that most XSLT processors read in the whole XML file into memory which is not an option as it is large. Note, however, that the enterprise version of Saxon can do streaming XSLT processing (if the XSLT script adheres to the restrictions).
You may also want to use an external XSLT processor outside your JVM instead, if applicable. This opens up for several more options.
Streaming in Saxon-EE: http://www.saxonica.com/documentation/sourcedocs/serial.html
您可以使用 XStream (http://x-stream.github.io/) 或 JOX ( http://www.wutka.com/jox.html) 识别xml然后转换它到一个Java Bean。我认为一旦获得 Bean,您就可以自动将 Bean 转换为 CSV。
You could use XStream (http://x-stream.github.io/) or JOX (http://www.wutka.com/jox.html) to recognize xml and then convert it to a Java Bean. I think you can convert the Beans to CSV automatically once you get the bean.
我不相信 SAX 是最适合您的方法。
不过,您可以通过多种方式使用 SAX。
如果某些元素(例如 ListingDetails)内的元素顺序无法保证,那么您需要积极主动。
当您启动 ListingDetails 时,将映射初始化为处理程序上的成员变量。在每个子元素中,在该映射中设置适当的键值。完成 ListingDetails 后,检查映射并显式模拟值,例如缺少元素的 null。假设每一项有一个 ListingDetails,将其保存到处理程序中的成员变量中。
现在,当您的 item 元素结束时,有一个函数可以根据您想要的顺序根据地图写入 CSV 行。
这样做的风险在于您是否损坏了 XML。我强烈考虑在项目启动时将所有这些变量设置为 null,然后检查错误并在项目结束时宣布它们。
I am not convinced that SAX is the best approach for you.
There are different ways you could use SAX here, though.
If element order is not guaranteed within certain elements, like ListingDetails, then you need to be proactive.
When you start a ListingDetails, initialize a map as a member variable on the handler. In each subelement, set the appropriate key-value in that map. When you finish a ListingDetails, examine the map and explicitly mock values such as nulls for the missing elements. Assuming you have one ListingDetails per item, save it to a member variable in the handler.
Now, when your item element is over, have a function that writes the line of CSVs based on the map in the order you wanted.
The risk with this is if you have corrupted XML. I would strongly consider setting all these variables to null when an item starts, and then checking for errors and announcing them when the item ends.
根据您所描述的需求进行编码的最佳方法是使用 FreeMarker 和 XML 处理的简单功能。 查看文档。
在这种情况下,您只需要生成 CSV 的模板。
替代方案是 XMLGen,但方法非常相似。只需查看该图表和示例,您将输出 CSV,而不是 SQL 语句。
这两种类似的方法不是“传统的”,但可以根据您的情况快速完成工作,并且您不必学习 XSL(我认为很难掌握)。
The best way to code based on your described requirement is to use the easy feature of FreeMarker and XML processing. See the docs.
In this case you will only need the template that will produce a CSV.
An alternative to this is XMLGen, but very similar in approach. Just look at that diagram and examples, and instead of SQL statements, you will output CSV.
These two similar approaches are not "conventional" but do the job very quickly for your situation, and you don't have to learn XSL (quite hard to master I think).
这里有一些使用 StAX 实现 XML 到 CSV 转换的代码。尽管您提供的 XML 只是一个示例,但我希望它能向您展示如何处理可选元素。
Here some code that implements the conversion of the XML to CSV using StAX. Although the XML you gave is only an example, I hope that this shows you how to handle the optional elements.
我不确定我是否理解该解决方案应该有多通用。您真的想为通用解决方案解析 1 GB 文件两次吗?如果您想要通用的东西,为什么在示例中跳过
元素?您需要处理多少种不同的格式?您真的不知道格式可以是什么(即使可以省略某些元素)吗?你能澄清一下吗?根据我的经验,通常最好以特定方式解析特定文件(但这并不排除使用通用 API)。我的回答将朝这个方向发展(澄清后我会更新它)。
如果您对 XML 感到不舒服,您可以考虑使用一些现有的(商业)库,例如 Ricebridge XML 管理器 和 CSV 管理器。有关完整信息,请参阅如何使用 Java 将 CSV 转换为 XML 以及 XML 转换为 CSV例子。该方法非常简单:您使用 XPath 表达式定义数据字段(这对于您的情况来说是完美的,因为您可以有“额外”元素),解析文件,然后将结果
List
传递给用于生成 CSV 文件的 CSV 组件。 API看起来很简单,代码经过测试(他们的测试用例的源代码在 BSD 风格的许可证下可用),他们声称支持千兆字节大小的文件。您可以花费 170 美元获得单一开发人员许可证,与开发人员每日费率相比,这并不是很贵。
他们提供 30 天试用版,看看吧。
另一种选择是使用 Spring Batch。 Spring Batch 提供了处理 XML 文件所需的一切 作为 输入 或输出(使用 StAX 和您选择的 XML 绑定框架)和 平面文件< /a> 作为输入或输出。请参阅:
您还可以使用 Smooks 将 XML 转换为 CSV 转换。另请参阅:
另一种选择是滚动您自己的解决方案,使用 StAX 解析器,或者为什么不使用 VTD-XML 和 XPath。看一下:
I'm not sure I understand how generic the solution should be. Do you really want to parse a 1 GB file twice for a generic solution? And if you want something generic, why did you skipped the
<category>
element in your example? How much different format do you need to handle? Do you really not know what the format can be (even if some element can be ommited)? Can you clarify?To my experience, it's generally preferable to parse specific files in a specific way (this doesn't exclude using a generic API though). My answer will go in this direction (and I'll update it after the clarification).
If you don't feel comfortable with XML, you could consider using some existing (commercial) libraries, for example Ricebridge XML Manager and CSV Manager. See How to convert CSV into XML and XML into CSV using Java for a full example. The approach is pretty straightforward: you define the data fields using XPath expressions (which is perfect in your case since you can have "extra" elements), parse the the file and then pass the result
List
to the CSV component to generate the CSV file. The API looks simple, the code tested (the source code of their test cases is available under a BSD-style license), they claim supporting gigabyte-sized files.You can get a Single Developer license for $170 which is not very expensive compared to developer daily rates.
They offer 30 days trial versions, have a look.
Another option would be to use Spring Batch. Spring batch offers everything required to work with XML files as input or output (using StAX and the XML binding framework of your choice) and flat files as input or output. See:
You could also use Smooks to do XML to CSV transformations. See also:
Another option would be to roll your own solution, using a StAX parser or, why not, using VTD-XML and XPath. Have a look at:
提供的代码应被视为草图而不是最终的文章。我不是 SAX 方面的专家,可以改进实现以获得更好的性能、更简单的代码等。也就是说,SAX 应该能够处理流式大型 XML 文件。
我将使用 SAX 解析器通过 2 遍来解决这个问题。 (顺便说一句,我还会使用 CSV 生成库来创建输出,因为这将处理 CSV 涉及的所有转义字符,但我尚未在草图中实现这一点)。
第一遍:
建立标题列的数量
第二遍:
输出 CSV
我假设 XML 文件格式正确。我假设我们没有具有预定义顺序的方案/DTD。
在第一遍中,我假设将为每个包含文本内容的 XML 元素或任何属性添加 CSV 列(我假设属性将包含某些内容!)。
第二遍确定了目标列数后,将执行实际的 CSV 输出。
根据您的示例 XML,我的代码草图将生成:
请注意,我使用了 google 集合 LinkedHashMultimap,因为这在将多个值与单个键关联时很有帮助。我希望你觉得这很有用!
The code provided should be considered a sketch rather than the definitive article. I am not an expert on SAX and the implementation could be improved for better performance, simpler code etc. That said SAX should be able to cope with streaming large XML files.
I would approach this problem with 2 passes using the SAX parser. (Incidentally, I would also use a CSV generating library to create the output as this would deal with all the fiddly character escaping that CSV involves but I haven't implemented this in my sketch).
First pass:
Establish number of header columns
Second pass:
Output CSV
I assume that the XML file is well formed. I assume that we don't have a scheme/DTD with a predefined order.
In the first pass I have assumed that a CSV column will be added for every XML element containing text content or for any attribute (I have assumed attributes will contain something!).
The second pass, having established the number of target columns, will do the actual CSV output.
Based on your example XML my code sketch would produce:
Please note I have used the google collections LinkedHashMultimap as this is helpful when associating multiple values with a single key. I hope you find this useful!
这看起来是使用 XSL 的一个很好的例子。考虑到您的基本要求,与自定义解析器或序列化器相比,使用 XSL 可能更容易获得正确的节点。好处是您的 XSL 可以定位“//Item//AverageTime”或您需要的任何节点,而无需担心节点深度。
更新:以下是我整理的 xslt,以确保其按预期工作。
This looks like a good case for using XSL. Given your basic requirements it may be easier to get at the right nodes with XSL as compared to custom parsers or serializers. The benefit would be that your XSL could target "//Item//AverageTime" or whatever nodes you require without worrying about node depth.
UPDATE: The following is the xslt I threw together to make sure this worked as expected.