Java:给定一个文件名列表,确保相应的 XML 仅包含有关这些文件的信息
我有一个文件列表(20,000 到 50,000 个文件)和一个大的 xml 文件。我希望文件 XML 仅包含有关 List
中文件的信息。
例如,假设我们的列表中只有文件 XYZ
,XML 文件如下所示。
<?xml version="1.0" encoding="ISO-8859-1"?>
<index>
<document>
<entry number="1">
<commentfield>
<name>FileName</name>
<value>XYZ</value>
</commentfield>
</entry>
<entry number="2">
<commentfield>
<name>Note</name>
<value>03-000</value>
</commentfield>
</entry>
</document>
<document>
<entry number="1">
<commentfield>
<name>FileName</name>
<value>ABC</value>
</commentfield>
</entry>
</document>
...
</index>
XML 包含两个文件的信息:XYZ
和ABC
。因此,我不希望最终的 XML 包含 last
因为此 document
ABC
不在我们的列表中。我要求在 KSH
脚本中成功运行,但它运行速度太慢(22000 个文件超过 4 小时。它还执行其他操作)。但我决定移植到 Java 以获得更好的性能。我所做的就是逐行读取到字符串中,当我点击 时,我解析出文件的名称,检查该文件是否存在于我们的列表中,如果然后写下整个
到另一个 xml
文件,然后再次读取下一个
。有更好的办法吗?
已经能够使用 DOM 解析器编写代码来完成此任务。代码比较长,需要的话请私信我。 tyvm 为您提供帮助
I have a List of files (20,000 to 50,000 files), and a large xml file. I want the file XML to only contains information about the file in the List
.
For example, let say we have only file XYZ
on our list, and XML files look as below.
<?xml version="1.0" encoding="ISO-8859-1"?>
<index>
<document>
<entry number="1">
<commentfield>
<name>FileName</name>
<value>XYZ</value>
</commentfield>
</entry>
<entry number="2">
<commentfield>
<name>Note</name>
<value>03-000</value>
</commentfield>
</entry>
</document>
<document>
<entry number="1">
<commentfield>
<name>FileName</name>
<value>ABC</value>
</commentfield>
</entry>
</document>
...
</index>
The XML contains information of two files, XYZ
and ABC
. Therefore, I do not want the final XML to contains the last <document> ... ABC ... </document>
because this document
ABC
is not on our List. I have requirements successfully work in KSH
script, but it runs too slow (over 4 hours for 22000 files. Well it also does something else). But I decide to port over to Java for better performance. What I have done is read line by line into a String, and when i hit </document>
, then I parse out the name of the file, check if this files exist on our list, if so then write this whole <document> ... </document>
to another xml
file, then read again the next <document>
. Is there a better way?
Already able to write code to accomplish this using DOM parser. The code are long, so if you need it, please pm me. tyvm for your help
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
使用正则表达式或其他任何方式自己“解析”XML 输入是一个脆弱的解决方案,它将对输入文本的格式(围绕空格等)施加不必要的限制。当 Java 库附带多个 XML 解析器时就不需要它了。
如果您可以保证输入 XML 不会变得太大而无法立即放入内存,那么使用 DOM 可能是最简单的方法。您可以:
Transformer
将修改后的 DOM 写入新文件。示例此处。更有效的选择可能是 StAX,它不需要立即读入整个输入。我没有使用过它,但它具有读取和写入文档的能力。您可以一次读取一个
元素,然后将其写回输出文件(如果它位于列表中)。 这里有一些教程。'Parsing' an XML input yourself using regex or whatever is a brittle solution that will place unnecessary restrictions on the format of the input text (around whitespace and such). There's no need for it when the Java library comes with several XML parsers.
Using DOM might be the easiest way to go, if you can guarantee that your input XML won't grow too large to slurp into memory at once. You can:
Transformer
. Example here.A more efficient option might be StAX, which doesn't require the entire input to be read in at once. I haven't used it, but it has the ability to read as well as write documents. You could read a
<document>
element at a time, and write it back to an output file if it's in the list. A bit of a tutorial here.目前,忽略解析和重写 XML 的最佳方法的细节,读取一次 XML 文件并查找列表中的每个文件名的基本策略似乎是合理的。
但是,您也许可以改进检查文件名列表中是否存在的方式(您不指定如何执行此操作)。有几种可能性:
Set
中,并检查集合中是否存在,这将是 O(1) 或 O(log N) 操作无论哪种方式都会比通过未排序列表的简单线性搜索有所改进。
Ignoring, for the moment, details of the best way to parse and re-write the XML, the basic strategy of reading once through the XML file and looking for each file name in the list seems sound.
However, you might be able to improve they way you check for presence in the list of filenames (you don't specify how you're doing that). A couple of possibilities:
Set
, and check for presence in the set, which will be an O(1) or O(log N) operationEither way would be an improvement over a simple linear search through an unsorted list.
有多种方法可以解决此问题:
XSL如果您有固定输入,那么这将变得非常简单list 您可以编写一个仅选择有效元素并输出它们的转换。这样您就不必实际编写任何代码,并且可以使用 xsltproc 之类的东西,这是非常有用的快速地!
这是我首先尝试的,因为它是专门为将 XML 转换为其他 XML 而创建的,它的代码更少,而且更少的代码意味着更少的维护。
以下是如何开始的想法,这将输出所有
元素,其中
元素不等于ABC
。关于
XSLT
有大量的资源和好书,您所需要做的就是提供受支持的
元素的白名单,并反转我的示例中的逻辑。如果您有一个
.xsd
或您可以创建一个,您的输入文件看起来不是很复杂,您可以使用JAXB自动生成对象层次结构来解析输入文件,然后您可以遍历生成的对象图并删除任何不符合您的条件的内容,并将其编组回文件。如果文件大小大于内存大小,JAXB 就不太可行。
There are multiple ways to approach this:
XSLT would make this very simple if you have a fixed input list you can write a transform that only selects valid elements and outputs them. This way you don't have to actually write any code and can use something like xsltproc that is very fast!
This is what I would try first because it specifically created for transforming XML into other XML, it is less code and less code is less maintenance.
Here is an idea of how to get started, this outputs all the
<document/>
elements where the<value/>
elements is not equal toABC
.There are plenty of resources and good books on
XSLT
all you need to do is provide a whitelist of supported<value/>
elements and reverse the logic in my example.If you have an
.xsd
or you can create one, your input file doesn't look very complicated, you can use JAXB to automatically generate a Object hierarchy to parse the input file and then you can walk the resulting Object graph and remove anything that doesn't meet your criteria and Marshall it back to a file.JAXB isn't very viable if the file size is larger than what will fit into memory.
您可以使用 Xpath 获取元素,如果您知道 xml 的结构,则可以删除这些元素。根据您处理 xml 的方式,您可以使用 DOM(对于大型 XML 来说可能不是一个好主意)
You can use Xpath to get the elements, if you know of the structure of the xml you can then remove those elements. Depending how you are processing your xml you can either use DOM (probably not a good idea for large XMLs)