C# - 是否可以（以及如何）使用 SgmlReader 执行 XSL 转换

发布于 2024-10-04 19:31:42 字数 3443 浏览 3 评论 0原文

我需要使用 XSLT 转换 HTML 网页的内容。因此，我使用 SgmlReader 并编写了如下所示的代码片段（我我想，最后，它也是一个 XmlReader ...）

XmlReader xslr = XmlReader.Create(new StringReader(
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
    "<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">" +
    "<xsl:output method=\"xml\" encoding=\"UTF-8\" version=\"1.0\" />" +
    "<xsl:template match=\"/\">" +
    "<XXX xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><xsl:value-of select=\"count(//br)\" /></XXX>" +
    "</xsl:template>" +
    "</xsl:stylesheet>"));

XslCompiledTransform xslt = new XslCompiledTransform();
xslt.Load(xslr);

using (SgmlReader html = new SgmlReader())
{
    StringBuilder sb = new StringBuilder();
    using (TextWriter sw = new StringWriter(sb))
    using (XmlWriter xw = new XmlTextWriter(sw))
    {
        html.InputStream = new StringReader(Resources.html_orig);
        html.DocType = "HTML";

        try
        {
            xslt.Transform(html, xw);
            string output = sb.ToString();
            System.Console.WriteLine(output);
        }
        catch (Exception exc)
        {
            System.Console.WriteLine("{0} : {1}", exc.GetType().Name, exc.Message);
            System.Console.WriteLine(exc.StackTrace);
        }
    }
}

尽管如此，我收到了错误消息，

NullReferenceException : Object reference not set to an instance of an object.
   at MS.Internal.Xml.Cache.XPathDocumentBuilder.Initialize(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
   at MS.Internal.Xml.Cache.XPathDocumentBuilder..ctor(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
   at System.Xml.XPath.XPathDocument.LoadFromReader(XmlReader reader, XmlSpace space)
   at System.Xml.XPath.XPathDocument..ctor(XmlReader reader, XmlSpace space)
   at System.Xml.Xsl.Runtime.XmlQueryContext.ConstructDocument(Object dataSource, String uriRelative, Uri uriResolved)
   at System.Xml.Xsl.Runtime.XmlQueryContext..ctor(XmlQueryRuntime runtime, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, WhitespaceRuleLookup wsRules)
   at System.Xml.Xsl.Runtime.XmlQueryRuntime..ctor(XmlQueryStaticData data, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, XmlSequenceWriter seqWrt)
   at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlSequenceWriter results)
   at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter writer, Boolean closeWriter)
   at System.Xml.Xsl.XmlILCommand.Execute(XmlReader contextDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter results)
   at System.Xml.Xsl.XslCompiledTransform.Transform(XmlReader input, XmlWriter results)

我找到了一种方法来解决此问题，方法是将 HTML 转换为 >XML，然后应用转换，但这是一个低效的解决方案，因为：

中间 XHTML 输出进入缓冲区，因此需要额外的内存
转换过程需要额外的 CPU > 加工并且相同的层次结构被遍历两次（理论上是不必要的）。

因此（因为我知道 StackOverflow 社区总是提供很好的答案，而其他 C# 论坛却让我完全失望 ;o） 我将寻求反馈以及建议，以便直接使用 HTML 执行 XSL 转换（即使 SgmlReader 需要被另一个类似的库替换）。

原文

I needed to transform the contents of an HTML web page using XSLT
. Hence I used SgmlReader and wrote the snippet shown below (I
thought, in the end, it's an XmlReader too ...)

XmlReader xslr = XmlReader.Create(new StringReader(
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
    "<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">" +
    "<xsl:output method=\"xml\" encoding=\"UTF-8\" version=\"1.0\" />" +
    "<xsl:template match=\"/\">" +
    "<XXX xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><xsl:value-of select=\"count(//br)\" /></XXX>" +
    "</xsl:template>" +
    "</xsl:stylesheet>"));

XslCompiledTransform xslt = new XslCompiledTransform();
xslt.Load(xslr);

using (SgmlReader html = new SgmlReader())
{
    StringBuilder sb = new StringBuilder();
    using (TextWriter sw = new StringWriter(sb))
    using (XmlWriter xw = new XmlTextWriter(sw))
    {
        html.InputStream = new StringReader(Resources.html_orig);
        html.DocType = "HTML";

        try
        {
            xslt.Transform(html, xw);
            string output = sb.ToString();
            System.Console.WriteLine(output);
        }
        catch (Exception exc)
        {
            System.Console.WriteLine("{0} : {1}", exc.GetType().Name, exc.Message);
            System.Console.WriteLine(exc.StackTrace);
        }
    }
}

Nonetheless , I get thos error message

NullReferenceException : Object reference not set to an instance of an object.
   at MS.Internal.Xml.Cache.XPathDocumentBuilder.Initialize(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
   at MS.Internal.Xml.Cache.XPathDocumentBuilder..ctor(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
   at System.Xml.XPath.XPathDocument.LoadFromReader(XmlReader reader, XmlSpace space)
   at System.Xml.XPath.XPathDocument..ctor(XmlReader reader, XmlSpace space)
   at System.Xml.Xsl.Runtime.XmlQueryContext.ConstructDocument(Object dataSource, String uriRelative, Uri uriResolved)
   at System.Xml.Xsl.Runtime.XmlQueryContext..ctor(XmlQueryRuntime runtime, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, WhitespaceRuleLookup wsRules)
   at System.Xml.Xsl.Runtime.XmlQueryRuntime..ctor(XmlQueryStaticData data, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, XmlSequenceWriter seqWrt)
   at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlSequenceWriter results)
   at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter writer, Boolean closeWriter)
   at System.Xml.Xsl.XmlILCommand.Execute(XmlReader contextDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter results)
   at System.Xml.Xsl.XslCompiledTransform.Transform(XmlReader input, XmlWriter results)

I found a way to work around this by converting the HTML to XML and then applying the transform , but that's an inefficient solution because :

Intermediate XHTML output goes to a buffer , so extra memory is needed
Conversion process needs extra CPU processing
and the same hierarchy is traversed twice (in theory unnecessarily).

So (since I know StackOverflow community always provides great answers whereas other C# forums have completely disappointed me ;o) I'll be looking for feedback and suggestions so as to perform XSL transformations using HTML directly (even if SgmlReader needs to be replaced by another similar library).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

樱桃奶球 2024-10-11 19:31:42

即使 SgmlReader 类扩展了 XmlReader 类，也不意味着它的行为也像 XmlReader 。

从技术上讲，SgmlReader 是 XmlReader 的子类也是没有意义的，因为 SGML 是 XML 的超集而不是子集。

您没有写出转换的目的，但总的来说 HTML Agility Pack 是操作 HTML 的一个不错的选择。

回复收藏 0 原文

╭ゆ眷念 2024-10-11 19:31:42

您是否尝试过使用 HTML Agility Pack 而不是 SgmlReader？您可以将 html 加载到其中，然后直接对其运行转换。不过，我不确定 XML 文档是否是内部创建的 - 尽管看起来好像不是，但您可能希望将内存和 CPU 使用情况与您尝试并放弃的转换方法进行比较。

//You already have your xslt loaded into var xslt...

HtmlDocument doc = new HtmlDocument();
doc.Load( ... );  //load your HTML doc, or use LoadXML from a string, etc  
xslt.Transform(doc, xw);

另请参阅此问题：如何使用 HTML Agility pack

Have you tried using the HTML Agility Pack instead of SgmlReader? You can load the html into it, and run a transform against it directly. I'm not positive if an XML document is created internally, though - although it seems as though one is not you would probably want to compare memory and CPU usage against the conversion method you tried and discarded.

//You already have your xslt loaded into var xslt...

HtmlDocument doc = new HtmlDocument();
doc.Load( ... );  //load your HTML doc, or use LoadXML from a string, etc  
xslt.Transform(doc, xw);

See also this question: How to use HTML Agility pack

回复收藏 0 原文

~没有更多了~