C# - 是否可以(以及如何)使用 SgmlReader 执行 XSL 转换
我需要使用 XSLT 转换 HTML 网页的内容 。因此,我使用 SgmlReader 并编写了如下所示的代码片段(我 我想,最后,它也是一个 XmlReader ...)
XmlReader xslr = XmlReader.Create(new StringReader(
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
"<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">" +
"<xsl:output method=\"xml\" encoding=\"UTF-8\" version=\"1.0\" />" +
"<xsl:template match=\"/\">" +
"<XXX xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><xsl:value-of select=\"count(//br)\" /></XXX>" +
"</xsl:template>" +
"</xsl:stylesheet>"));
XslCompiledTransform xslt = new XslCompiledTransform();
xslt.Load(xslr);
using (SgmlReader html = new SgmlReader())
{
StringBuilder sb = new StringBuilder();
using (TextWriter sw = new StringWriter(sb))
using (XmlWriter xw = new XmlTextWriter(sw))
{
html.InputStream = new StringReader(Resources.html_orig);
html.DocType = "HTML";
try
{
xslt.Transform(html, xw);
string output = sb.ToString();
System.Console.WriteLine(output);
}
catch (Exception exc)
{
System.Console.WriteLine("{0} : {1}", exc.GetType().Name, exc.Message);
System.Console.WriteLine(exc.StackTrace);
}
}
}
尽管如此,我收到了错误消息,
NullReferenceException : Object reference not set to an instance of an object.
at MS.Internal.Xml.Cache.XPathDocumentBuilder.Initialize(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
at MS.Internal.Xml.Cache.XPathDocumentBuilder..ctor(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
at System.Xml.XPath.XPathDocument.LoadFromReader(XmlReader reader, XmlSpace space)
at System.Xml.XPath.XPathDocument..ctor(XmlReader reader, XmlSpace space)
at System.Xml.Xsl.Runtime.XmlQueryContext.ConstructDocument(Object dataSource, String uriRelative, Uri uriResolved)
at System.Xml.Xsl.Runtime.XmlQueryContext..ctor(XmlQueryRuntime runtime, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, WhitespaceRuleLookup wsRules)
at System.Xml.Xsl.Runtime.XmlQueryRuntime..ctor(XmlQueryStaticData data, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, XmlSequenceWriter seqWrt)
at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlSequenceWriter results)
at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter writer, Boolean closeWriter)
at System.Xml.Xsl.XmlILCommand.Execute(XmlReader contextDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter results)
at System.Xml.Xsl.XslCompiledTransform.Transform(XmlReader input, XmlWriter results)
我找到了一种方法来解决此问题,方法是将 HTML 转换为 >XML,然后应用转换,但这是一个低效的解决方案,因为:
- 中间 XHTML 输出进入缓冲区,因此需要额外的内存
- 转换过程需要额外的 CPU > 加工 并且相同的层次结构被遍历两次(理论上是不必要的)。
因此(因为我知道 StackOverflow 社区总是提供很好的答案,而其他 C# 论坛却让我完全失望 ;o) 我将寻求反馈以及建议,以便直接使用 HTML 执行 XSL 转换(即使 SgmlReader 需要被另一个类似的库替换)。
I needed to transform the contents of an HTML web page using XSLT
. Hence I used SgmlReader and wrote the snippet shown below (I
thought, in the end, it's an XmlReader too ...)
XmlReader xslr = XmlReader.Create(new StringReader(
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
"<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">" +
"<xsl:output method=\"xml\" encoding=\"UTF-8\" version=\"1.0\" />" +
"<xsl:template match=\"/\">" +
"<XXX xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><xsl:value-of select=\"count(//br)\" /></XXX>" +
"</xsl:template>" +
"</xsl:stylesheet>"));
XslCompiledTransform xslt = new XslCompiledTransform();
xslt.Load(xslr);
using (SgmlReader html = new SgmlReader())
{
StringBuilder sb = new StringBuilder();
using (TextWriter sw = new StringWriter(sb))
using (XmlWriter xw = new XmlTextWriter(sw))
{
html.InputStream = new StringReader(Resources.html_orig);
html.DocType = "HTML";
try
{
xslt.Transform(html, xw);
string output = sb.ToString();
System.Console.WriteLine(output);
}
catch (Exception exc)
{
System.Console.WriteLine("{0} : {1}", exc.GetType().Name, exc.Message);
System.Console.WriteLine(exc.StackTrace);
}
}
}
Nonetheless , I get thos error message
NullReferenceException : Object reference not set to an instance of an object.
at MS.Internal.Xml.Cache.XPathDocumentBuilder.Initialize(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
at MS.Internal.Xml.Cache.XPathDocumentBuilder..ctor(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
at System.Xml.XPath.XPathDocument.LoadFromReader(XmlReader reader, XmlSpace space)
at System.Xml.XPath.XPathDocument..ctor(XmlReader reader, XmlSpace space)
at System.Xml.Xsl.Runtime.XmlQueryContext.ConstructDocument(Object dataSource, String uriRelative, Uri uriResolved)
at System.Xml.Xsl.Runtime.XmlQueryContext..ctor(XmlQueryRuntime runtime, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, WhitespaceRuleLookup wsRules)
at System.Xml.Xsl.Runtime.XmlQueryRuntime..ctor(XmlQueryStaticData data, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, XmlSequenceWriter seqWrt)
at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlSequenceWriter results)
at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter writer, Boolean closeWriter)
at System.Xml.Xsl.XmlILCommand.Execute(XmlReader contextDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter results)
at System.Xml.Xsl.XslCompiledTransform.Transform(XmlReader input, XmlWriter results)
I found a way to work around this by converting the HTML to XML and then applying the transform , but that's an inefficient solution because :
- Intermediate XHTML output goes to a buffer , so extra memory is needed
- Conversion process needs extra CPU processing
and the same hierarchy is traversed twice (in theory unnecessarily).
So (since I know StackOverflow community always provides great answers whereas other C# forums have completely disappointed me ;o) I'll be looking for feedback and suggestions so as to perform XSL transformations using HTML directly (even if SgmlReader needs to be replaced by another similar library).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
即使
SgmlReader
类扩展了XmlReader
类,也不意味着它的行为也像XmlReader
。从技术上讲,
SgmlReader
是XmlReader
的子类也是没有意义的,因为 SGML 是 XML 的超集而不是子集。您没有写出转换的目的,但总的来说 HTML Agility Pack 是操作 HTML 的一个不错的选择。
Even if the
SgmlReader
class is extending theXmlReader
class it doesn't mean that it also behaves like anXmlReader
.Technically it also does not make sense that
SgmlReader
is a subclass ofXmlReader
, simply because SGML is a superset of XML and not a subset.You didn't write about the purpose of your transformation, but in general HTML Agility Pack is a good option for manipulating HTML.
您是否尝试过使用 HTML Agility Pack 而不是
SgmlReader
?您可以将 html 加载到其中,然后直接对其运行转换。不过,我不确定 XML 文档是否是内部创建的 - 尽管看起来好像不是,但您可能希望将内存和 CPU 使用情况与您尝试并放弃的转换方法进行比较。另请参阅此问题:如何使用 HTML Agility pack
Have you tried using the HTML Agility Pack instead of
SgmlReader
? You can load the html into it, and run a transform against it directly. I'm not positive if an XML document is created internally, though - although it seems as though one is not you would probably want to compare memory and CPU usage against the conversion method you tried and discarded.See also this question: How to use HTML Agility pack