JAVA中不使用XMLParser部分解析XML文件
所以我发现可以使用缓冲读取器/写入器将 xml 文件逐字复制到新的 xml 文件。但是,我想知道是否可以只刮掉文档的一部分?
例如,看看这个例子:
<?xml version="1.0" encoding="UTF-8"?>
<BookCatalogue xmlns="http://www.publishing.org">
<w:pStyle w:val="TOAHeading" />
<Book>
<Title>Yogasana Vijnana: the Science of Yoga</Title>
<author>Dhirendra Brahmachari</Author>
<Date>1966</Date>
<ISBN>81-40-34319-4</ISBN>
<Publisher>Dhirendra Yoga Publications</Publisher>
<Cost currency="INR">11.50</Cost>
</Book>
<Book>
<Title>The First and Last Freedom</Title>
<v:imagedata r:id="rId7" o:title="" croptop="10523f" cropbottom="11721f" />
<Author>J. Krishnamurti</Author>
<Date>1954</Date>
<ISBN>0-06-064831-7</ISBN>
<Publisher>Harper & Row</Publisher>
<Cost currency="USD">2.95</Cost>
</Book>
<w:pStyle w:val="TOAHeading2" />
</BookCatalogue>
抱歉,如果这不是正确的 XML 代码,我只是将我正在查看的文档中的花絮添加到我找到的这个示例中。但基本上,如果我想查找“标题”的实例(在本例中为第 3 行 -> TOAHeading),则从标题开始向下抓取所有内容,直到找到另一个标题实例并将其复制到另一个 xml 文件。这可能吗?此外,如果我想将其作为要存储的临时文件,并且仅在找到“图像”实例(在本例中为第 14 行)时才保留该文件,这也可能吗?我试图以最简单的方式做到这一点,所以有人对此有任何想法或经验吗?提前致谢。
public class IPDriver
{
public static void main(String[] args) throws IOException
{
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStreamReader("C:/Documents and Settings/user/workspace/Intern Project/Proposals/Converted Proposals/Extracted Items/ProposalOne/word/document.xml"), "UTF-8"));
BufferedWriter writer = new BufferedWriter(new OutputStreamReader(new FileOutputStreamReader("C:/Documents and Settings/user/workspace/Intern Project/Proposals/Converted Proposals/Extracted Items/ProposalOne/word/tempdocument.xml"), "UTF-8"));
String line = null;
while ((line = reader.readLine()) != null)
{
writer.write(line);
}
// Close to unlock.
reader.close();
// Close to unlock and flush to disk.
writer.close();
}
}
来自我的实际 XML 文档的示例
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="address">
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="Street">
- <w:r w:rsidRPr="00822244">
<w:t>6841 Benjamin Franklin Drive</w:t>
</w:r>
</w:smartTag>
</w:smartTag>
</w:p>
- <w:p w:rsidR="00B41602" w:rsidRPr="00822244" w:rsidRDefault="00B41602" w:rsidP="007C3A42">
- <w:pPr>
<w:pStyle w:val="Address" />
</w:pPr>
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="City">
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="place">
只是 .docx 中的基本 document.xml 文件
so I found out it was possible to use the buffered reader/writer to copy an xml file over word for word to a new xml file. However, I was wondering if it would be possible to scrape out only a portion of the document?
For example, looking at this example:
<?xml version="1.0" encoding="UTF-8"?>
<BookCatalogue xmlns="http://www.publishing.org">
<w:pStyle w:val="TOAHeading" />
<Book>
<Title>Yogasana Vijnana: the Science of Yoga</Title>
<author>Dhirendra Brahmachari</Author>
<Date>1966</Date>
<ISBN>81-40-34319-4</ISBN>
<Publisher>Dhirendra Yoga Publications</Publisher>
<Cost currency="INR">11.50</Cost>
</Book>
<Book>
<Title>The First and Last Freedom</Title>
<v:imagedata r:id="rId7" o:title="" croptop="10523f" cropbottom="11721f" />
<Author>J. Krishnamurti</Author>
<Date>1954</Date>
<ISBN>0-06-064831-7</ISBN>
<Publisher>Harper & Row</Publisher>
<Cost currency="USD">2.95</Cost>
</Book>
<w:pStyle w:val="TOAHeading2" />
</BookCatalogue>
Sorry if this is not proper XML Code, I just added the tidbits from the document I was looking at to this sample I found. But basically, if I wanted to look for the an instance of "heading" (in this case, 3rd line -> TOAHeading), then scrape everything from heading down until another instance of heading is found and copy it to another xml file. Is that possible? Furthermore, if I wanted to make that a temporary file I'm storing to, and only keep that file if an instance of "image" (in this case, 14th line) is found, is that possible as well? I'm trying to do this in the simplest way possible, so does anyone have any ideas or experience with this? Thanks in advance.
public class IPDriver
{
public static void main(String[] args) throws IOException
{
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStreamReader("C:/Documents and Settings/user/workspace/Intern Project/Proposals/Converted Proposals/Extracted Items/ProposalOne/word/document.xml"), "UTF-8"));
BufferedWriter writer = new BufferedWriter(new OutputStreamReader(new FileOutputStreamReader("C:/Documents and Settings/user/workspace/Intern Project/Proposals/Converted Proposals/Extracted Items/ProposalOne/word/tempdocument.xml"), "UTF-8"));
String line = null;
while ((line = reader.readLine()) != null)
{
writer.write(line);
}
// Close to unlock.
reader.close();
// Close to unlock and flush to disk.
writer.close();
}
}
Example From My Actual XML Document
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="address">
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="Street">
- <w:r w:rsidRPr="00822244">
<w:t>6841 Benjamin Franklin Drive</w:t>
</w:r>
</w:smartTag>
</w:smartTag>
</w:p>
- <w:p w:rsidR="00B41602" w:rsidRPr="00822244" w:rsidRDefault="00B41602" w:rsidP="007C3A42">
- <w:pPr>
<w:pStyle w:val="Address" />
</w:pPr>
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="City">
- <w:smartTag w:uri="urn:schemas-microsoft-com:office:smarttags" w:element="place">
Just your basic document.xml file from a .docx
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可能想阅读有关 java XML 解析器的内容。有两种类型:SAX 解析器和 DOM 解析器。
SAX 解析器是“基于事件”的,这意味着解析器将为您扫描 xml 文件并调用您定义的一组“回调”方法,例如 startElement() 和 endElement()。 SAX 解析器对于非常大的 xml 文件非常有效。
DOM 解析器会将整个 XML 读入内存,然后您可以通过调用 getElementsByTagName("w:pStyle") 等方法来查询“DOM 对象”。 Dom 解析器往往更容易使用,但比 SAX 解析器使用更多内存。
虽然会有一些学习曲线,但这些是在 java 中处理 XML 的标准方法。还有一些旨在简化标准库的库,例如 JDom。
You will probably want to read about java XML Parsers. There are two types, SAX parsers and DOM parsers.
SAX parsers are 'event based', meaning that the parser will scan over the xml file for you and call a set of 'callback' methods that you have defined, such as startElement() and endElement(). SAX parsers are efficient for very large xml files.
DOM parsers will read the entire XML into memory and then you can just query the 'DOM object' by calling methods like getElementsByTagName("w:pStyle"). Dom parsers tend to be a bit easier to work with, but use more memory than SAX parsers.
There will be a bit of a learning curve, but these are the standard ways of processing XML in java. There are also libraries designed to simplify the standard libraries, such as JDom.
我已经看到了很多技术上正确的建议,但是您的请求(按书面形式执行时)向我表明您有以下要求:
如果我理解您的需求,那么您基本上想要对非常结构化的数据(XML 标记)进行完全非结构化的解析。在这种情况下,使用 XML 解析器、XSLT、DOM 解析器来处理任何针对 XML 规范编写的内容将很难满足您的需求。
您需要对文档内容进行不区分大小写的扫描,直到获得匹配项,然后提取该匹配项与结束匹配项之间的所有字符。
如果文档不是很大(例如 1 MB 或更小),只需将整个内容读入内存中的字符串中,然后对您想要的不同大小写版本使用非常快速且肮脏的“indexOf”,或者读取将整个内容放入 char[] 中,请编写一些更有效的扫描代码,以便与要开始解析的起始值进行不区分大小写的匹配。
如果我误解了您的要求,并且它实际上比您上面的描述听起来更加结构化,那么请使用更专注于真正的 XML 解析的其他建议之一。我只是把这个解决方案放在那里,以防它像你看起来的那样随机。
(注意:我并不是说这不好,只是以前从未见过这个请求。您有自己的理由需要这样做,我们会尽力提供帮助;)
I've seen a lot of technically-correct suggestions, but your request (when taken as-written) suggests to me that you have the following requirements:
If I understood your requirements, you are basically wanting to do a totally unstructured parse of a very structured piece of data (XML markup). In that case, using an XML parser, an XSLT, DOM parser for anything written against the XML spec is going to be a pain in the ass to mangle to your needs.
You'll need to do a case-insensitive scan of your document contents until you get your match, then pull all the characters between that match and an ending match.
If the documents aren't huge (say 1 MB or smaller) just read the whole thing into memory into a String and either use a really quick and dirty use of "indexOf" for the different cased versions of what you want, OR read the whole thing into a char[] do write some more efficient scanning code for a case-insensitive match for the starting value you want to begin parsing at.
If I misunderstood your requirement and it is actually much more structured than it sounded in your description above, then please use one of the other suggestions that is more focused on true XML parsing. I am just putting this solution out there in the off chance that it was as random as you made it out to seem.
(NOTE: I'm not saying it's BAD, just never seen that request before. You have your own reasons for needing to do that and we'll just try and help ;)
执行此操作的正确方法是使用 XSLT 转换,该转换会发出除您之外的所有内容不想要。这正是 XSLT 要做的事情。
不要手动解析它,这会导致失败,绝对不要考虑使用正则表达式,这会导致史诗般的失败。
如果您无法理解 XLST,并且它是过程编码的范式转变,请在此处寻求帮助,或者在您的用例中使用传统的 XML 解析库,您可能需要使用一些
DOM基于
的解析器,我更喜欢 JDOM。The proper way to do this would be to use an XSLT transform that emitted everything but what you don't want. This is just what XSLT is mean to do.
Don't parse this by hand it will lead to failure, definitely don't even think of using regular expressions that will lead to epic failure.
If you can't comprehend XLST, and it is a paradigm shift from procedural coding, ask for help here, or fall back on using a traditional XML parsing library for your use case you are going to probably have to use some
DOM
based parser, I prefer JDOM.如果您确定您的 XML 看起来像这样,您只需将每一行与
进行比较,然后开始输出以下行,直到找到与
匹配的行。但你为什么要这样做呢?它对任何格式更改都很脆弱。 使用 XML 解析器(和 XML 编写器),它会让生活变得更加轻松。
If you are sure that your XML looks like this, you can simply compare each line with
<w:pStyle w:val="TOAHeading" />
, and then start outputting the following lines, until you find a line which matches<w:pStyle w:val="TOAHeading2" />
.But why would you do this? It is fragile to any formatting changes. Use an XML Parser (and a XML writer), it makes the life much easier.