使用 STaX 将一个 xml 转换为另一个 xml 需要花费大量时间
我正在使用以下代码将一个大的 xml 流转换为另一个流:
import java.io.ByteArrayInputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.Writer;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.Result;
import javax.xml.transform.Source;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXResult;
import javax.xml.transform.stax.StAXSource;
public class TryMe
{
public static void main (final String[] args)
{
XMLInputFactory inputFactory = null;
XMLEventReader eventReaderXSL = null;
XMLEventReader eventReaderXML = null;
XMLOutputFactory outputFactory = null;
XMLEventWriter eventWriter = null;
Source XSL = null;
Source XML = null;
inputFactory = XMLInputFactory.newInstance();
outputFactory = XMLOutputFactory.newInstance();
inputFactory.setProperty("javax.xml.stream.isSupportingExternalEntities", Boolean.TRUE);
inputFactory.setProperty("javax.xml.stream.isNamespaceAware", Boolean.TRUE);
inputFactory.setProperty("javax.xml.stream.isReplacingEntityReferences", Boolean.TRUE);
try
{
eventReaderXSL = inputFactory.createXMLEventReader("my_template",
new InputStreamReader(TryMe.class.getResourceAsStream("my_template.xsl")));
eventReaderXML = inputFactory.createXMLEventReader("big_one", new InputStreamReader(
TryMe.class.getResourceAsStream("big_one.xml")));
}
catch (final javax.xml.stream.XMLStreamException e)
{
System.out.println(e.getMessage());
}
// get a TransformerFactory object
final TransformerFactory transfFactory = TransformerFactory.newInstance();
// define the Source object for the stylesheet
try
{
XSL = new StAXSource(eventReaderXSL);
}
catch (final javax.xml.stream.XMLStreamException e)
{
System.out.println(e.getMessage());
}
Transformer tran2 = null;
// get a Transformer object
try
{
tran2 = transfFactory.newTransformer(XSL);
}
catch (final javax.xml.transform.TransformerConfigurationException e)
{
System.out.println(e.getMessage());
}
// define the Source object for the XML document
try
{
XML = new StAXSource(eventReaderXML);
}
catch (final javax.xml.stream.XMLStreamException e)
{
System.out.println(e.getMessage());
}
// create an XMLEventWriter object
try
{
eventWriter = outputFactory.createXMLEventWriter(new OutputStreamWriter(System.out));
}
catch (final javax.xml.stream.XMLStreamException e)
{
System.out.println(e.getMessage());
}
// define the Result object
final Result XML_r = new StAXResult(eventWriter);
// call the transform method
try
{
tran2.transform(XML, XML_r);
}
catch (final javax.xml.transform.TransformerException e)
{
System.out.println(e.getMessage());
}
// clean up
try
{
eventReaderXSL.close();
eventReaderXML.close();
eventWriter.close();
}
catch (final javax.xml.stream.XMLStreamException e)
{
System.out.println(e.getMessage());
}
}
}
my_template 是这样的:
<xsl:stylesheet version = '1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:preserve-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="@k8[parent::point]">
<xsl:attribute name="k8">
<xsl:value-of select="'xxxxxxxxxxxxxx'"/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
xml 是一个很长的列表
<data>
<point .... k8="blablabla" ... ></point>
<point .... k8="blablabla" ... ></point>
<point .... k8="blablabla" ... ></point>
....
<point .... k8="blablabla" ... ></point>
</data>
如果我使用身份转换器(使用 tranfsFactory.newTransformer() 而不是 transFactory(XSL ) ) 在处理输入流时会产生输出。相反,使用我的模板,没有办法......变压器读取所有输入,然后开始生成输出(当然,在有大量流的情况下,结果之前经常会出现内存不足。
有什么想法吗?我吓坏了。 。我无法理解我的代码/xslt 有什么问题,
非常感谢!
I'm using the following code to transform a big xml stream to another stream:
import java.io.ByteArrayInputStream;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.Writer;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.events.XMLEvent;
import javax.xml.transform.Result;
import javax.xml.transform.Source;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stax.StAXResult;
import javax.xml.transform.stax.StAXSource;
public class TryMe
{
public static void main (final String[] args)
{
XMLInputFactory inputFactory = null;
XMLEventReader eventReaderXSL = null;
XMLEventReader eventReaderXML = null;
XMLOutputFactory outputFactory = null;
XMLEventWriter eventWriter = null;
Source XSL = null;
Source XML = null;
inputFactory = XMLInputFactory.newInstance();
outputFactory = XMLOutputFactory.newInstance();
inputFactory.setProperty("javax.xml.stream.isSupportingExternalEntities", Boolean.TRUE);
inputFactory.setProperty("javax.xml.stream.isNamespaceAware", Boolean.TRUE);
inputFactory.setProperty("javax.xml.stream.isReplacingEntityReferences", Boolean.TRUE);
try
{
eventReaderXSL = inputFactory.createXMLEventReader("my_template",
new InputStreamReader(TryMe.class.getResourceAsStream("my_template.xsl")));
eventReaderXML = inputFactory.createXMLEventReader("big_one", new InputStreamReader(
TryMe.class.getResourceAsStream("big_one.xml")));
}
catch (final javax.xml.stream.XMLStreamException e)
{
System.out.println(e.getMessage());
}
// get a TransformerFactory object
final TransformerFactory transfFactory = TransformerFactory.newInstance();
// define the Source object for the stylesheet
try
{
XSL = new StAXSource(eventReaderXSL);
}
catch (final javax.xml.stream.XMLStreamException e)
{
System.out.println(e.getMessage());
}
Transformer tran2 = null;
// get a Transformer object
try
{
tran2 = transfFactory.newTransformer(XSL);
}
catch (final javax.xml.transform.TransformerConfigurationException e)
{
System.out.println(e.getMessage());
}
// define the Source object for the XML document
try
{
XML = new StAXSource(eventReaderXML);
}
catch (final javax.xml.stream.XMLStreamException e)
{
System.out.println(e.getMessage());
}
// create an XMLEventWriter object
try
{
eventWriter = outputFactory.createXMLEventWriter(new OutputStreamWriter(System.out));
}
catch (final javax.xml.stream.XMLStreamException e)
{
System.out.println(e.getMessage());
}
// define the Result object
final Result XML_r = new StAXResult(eventWriter);
// call the transform method
try
{
tran2.transform(XML, XML_r);
}
catch (final javax.xml.transform.TransformerException e)
{
System.out.println(e.getMessage());
}
// clean up
try
{
eventReaderXSL.close();
eventReaderXML.close();
eventWriter.close();
}
catch (final javax.xml.stream.XMLStreamException e)
{
System.out.println(e.getMessage());
}
}
}
my_template is something like this:
<xsl:stylesheet version = '1.0'
xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
<xsl:preserve-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="@k8[parent::point]">
<xsl:attribute name="k8">
<xsl:value-of select="'xxxxxxxxxxxxxx'"/>
</xsl:attribute>
</xsl:template>
</xsl:stylesheet>
and xml is a long long list of
<data>
<point .... k8="blablabla" ... ></point>
<point .... k8="blablabla" ... ></point>
<point .... k8="blablabla" ... ></point>
....
<point .... k8="blablabla" ... ></point>
</data>
If i use an identity transformer (using tranfsFactory.newTransformer() instead of transFactory(XSL) ) while the input stream is processed the output is produced. Instead with my template there's no way.. The transformer reads all the input and then starts to produce the output (with a large stream of course very often an out of memory comes before a result.
Any Idea?? i'm freaking out.. i can't understand what's wrong in my code/xslt
Many thanks in advance!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
XSLT 1.0 和 2.0 在完整 XML 的树数据模型上运行,因此 XSLT 1.0 和 2.0 处理器通常将完整 XML 输入文档读入树中,并创建一个结果树,然后将其序列化。您似乎认为使用 StAX 会改变 XSLT 的行为,但我认为情况并非如此,XSLT 处理器构建树,因为样式表可能需要复杂的 XPath 导航器,例如前导或前导同级。
但是,当您使用 Java 时,您可以查看 Saxon 9.3 及其实验性 XSLT 3.0 流支持,这样在处理非常大的 XML 输入文档时就不会耗尽内存。
XSLT 中不寻常的部分是
,通常简单地写为
但您需要使用 XSLT 处理器测试这是否会改变性能。Well XSLT 1.0 and 2.0 operate on a tree data model of the complete XML so XSLT 1.0 and 2.0 processors usually read the complete XML input document into a tree and create a result tree that is then serialized. You seem to assume that using StAX changes the behaviour of XSLT but I don't think that is the case, the XSLT processor builds the tree as the stylessheet could require complex XPath navigator like preceding or preceding-sibling.
However as you use Java you could look into Saxon 9.3 and its experimental XSLT 3.0 streaming support, that way you should not run out of memory when processing very large XML input documents.
The part in your XSLT that is unusual is
<xsl:template match="@k8[parent::point]">
, that is usually simply written as<xsl:template match="point/@k8">
but you would need to test with your XSLT processor whether that changes performance.使用 XSLT 可能不是最好的方法,因为其他人指出您的解决方案要求处理器在写出输出之前将整个文档读入内存。您可能希望考虑使用 SAX 解析器顺序读取每个节点,执行所需的任何转换(如果需要,使用数据驱动映射)并写出转换后的数据。这避免了在内存中创建整个文档树的需要,并且可以显着加快处理速度,因为您无需尝试构建要写出的复杂文档。
问问自己输出格式是否简单稳定,然后重新考虑 XSLT 的使用。对于常规数据的大型数据集,您可能还希望考虑 XML 是否是传输信息的良好文件格式。
Using XSLT is probably not the best approach, as others have pointed out your solution requires that the processor reads the entire document into memory before writing out the output. You might wish to consider using a SAX parser to sequentially read in each node, perform any transformation required (using a data driven mapping if necessary) and write out the transformed data. This avoids the requirement to create an entire document tree in memory and could enable significantly faster processing as you're not attempting to build a complex document to write out.
Ask yourself if the output format is simple and stable, and then reconsider the use of XSLT. For large datasets of regular data, you might also wish to consider if XML is a good file format for transferring information.
如果您发现完成这项工作需要很长时间,那么你需要重新设计你的任务方法,以避免在开始处理输出文件之前读取整个输入文件。没有任何东西可以通过你的代码进行调整来使其神奇地更快 - 你需要解决你的核心问题。算法。
If you are finding that it takes too long for this work to complete, then you need to redesign your approach to your task to avoid reading in the entire input file before you start to process the output file. There is nothing that can be tweaked with your code to make it magically faster - you need to address the core of your algorithm.
您使用 XSL 进行的转换有多复杂?您可以单独使用 StAX 进行相同的转换吗?
使用 StAX,可以很容易地编写解析器来匹配特定节点,然后在此时写入的输出流中插入、更改或删除节点。因此,您可以单独使用 StAX,而不是使用 XSL 进行转换。通过这种方式,您可以受益于 API 的流式传输特性(不在内存中缓冲大型树),因此不会出现内存问题。
巧合的是,这个 最近对另一个问题的回答可能会对您有所帮助。
How complex is the transformation you are doing with XSL? Could you make the same transformation using StAX alone?
With StAX it is quite easy to write a parser to match a particular node and then to insert, alter or remove nodes in the output stream you are writing to at that point. So instead of using XSL for the transform, you could maybe use StAX alone. This way you benefit from the streaming nature of the API (not buffering large tree in memory) and so there will be no memory issue.
Co-incidentally, this recent answer to another question might help you with that.
正如其他人指出的那样,使用 Stax 不会改变 XSLT 的工作方式:它在开始任何工作之前首先读取所有内容。
如果您需要处理非常大的文件,则必须使用 XSLT 以外的其他方式。
然后是不同的选项:
As others have pointed, using Stax won't change the way XSLT is working : It reads first everything before starting any work.
If you need to work with very large files, you'll have to use something other than XSLT.
Then are different options:
尝试 apache xsltc 以获得更好的性能 - 它使用代码生成来简单转换。
您的 XSLt 转换看起来非常简单,您的输入格式也是如此 - 当然您可以进行 StAX/SAX 手动处理并获得非常好的性能提升。
Try apache xsltc for better performance - it uses code generation to simply transforms.
Your XSLt transform looks really simple, and so does your input format - surely you can do StAX/SAX manual processing and gain a really good performance increase.