如何使 XmlDocument 能够处理没有引用属性的 XML?

发布于 2024-12-03 09:50:45 字数 2541 浏览 0 评论 0原文

我有一个 asp.net vb 项目,需要解析来自数据库的一些原始 XML,XML 的布局如下:

<HTML><HEAD><TITLE></TITLE></HEAD><BODY><STRONG><A name=SN>AARTS</A>, <A name=GN>Michelle Marie</A>, </STRONG><A name=HO>B.Sc.</A>, <A name=HO>M.Sc.</A>, <A name=HO>Ph.D.</A>; <A name=OC>scientist, professor</A>; b. <A name=BC>St. Marys</A>, Ont. <A name=BY>1970</A>; <A name=PA>d. Wm. and H. Aarts</A>; <A name=ED>e. Univ. of Western Ont. B.Sc.(Hons.) 1994, M.Sc. 1997</A>; <A name=ED>McGill Univ. Ph.D. 2002</A>; <A name=MA>m. L. MacManus</A>; two children; <A name=PO>CANADA RESEARCH CHAIR IN SIGNAL TRANSDUCTION IN ISCHEMIA</A> and <A name=PO>ASST. PROF., DEPT. OF BIOL. SCI., UNIV. OF TORONTO SCARBOROUGH 2006&ndash;&nbsp;&nbsp;</A>; Postdoctoral Fellow, Toronto Western Hosp. 2000&ndash;06; Expert Cons., Auris Med. SAS, Montpellier, France; mem., Centre for the Neurobiol. of Stress; named INMHA Brainstar of the Year 2003; Bd. of Dirs. &amp; Fundraising Chair, N'Sheemaehn Childcare; mem., Soc. for Neurosci.; Cdn. Physiol. Soc.; Cdn. Assn. for Neurosci.; <A name=WK>co-author: 'Therapeutic Tools in Brain Damage' in <EM>Proteomics and Protein Interactions: Biology, Chemistry, Bioinformatics and Drug Design </EM>2005; 18 pub. journal articles</A>; Office: <A name=OF1_L1>1265 Military Trail</A>, <A name=OF1_CT>Scarborough</A>, <A name=OF1_PR>Ont.</A> <A name=OF1_PC>M1C 1A4</A>. </BODY></HTML>

我使用的背后的代码是这样的,

        Dim FullBio As New System.Xml.XmlDocument
        Dim NodeList As System.Xml.XmlNodeList
        Dim Node As System.Xml.XmlNode

        FullBio.LoadXml(bio.Item(11))
        NodeList = FullBio.SelectNodes("a")

        For Each Node In NodeList
            Dim name = Node.Attributes.GetNamedItem("name").Value()
            lblEducation.Text = lblEducation.Text + name.ToString() + Node.InnerText + "<br />"
        Next

所以将 XML 加载到 Xml 文档中

FullBio.LoadXml(bio.Item(11))
is the XML I provided at the top. I am getting this error message:

'SN' is an unexpected token. The expected token is '"' or '''. Line 1, position 49.

我知道错误是因为属性没有被引用。无论如何,有没有办法让 XmlDocument 理解属性,或者在将字符串加载到 xmldoc 之前使用 reg 表达式向属性添加引号的简单方法?

I have an asp.net vb project that needs to parse some raw XML that is coming out of a database the XML is laid out like this:

<HTML><HEAD><TITLE></TITLE></HEAD><BODY><STRONG><A name=SN>AARTS</A>, <A name=GN>Michelle Marie</A>, </STRONG><A name=HO>B.Sc.</A>, <A name=HO>M.Sc.</A>, <A name=HO>Ph.D.</A>; <A name=OC>scientist, professor</A>; b. <A name=BC>St. Marys</A>, Ont. <A name=BY>1970</A>; <A name=PA>d. Wm. and H. Aarts</A>; <A name=ED>e. Univ. of Western Ont. B.Sc.(Hons.) 1994, M.Sc. 1997</A>; <A name=ED>McGill Univ. Ph.D. 2002</A>; <A name=MA>m. L. MacManus</A>; two children; <A name=PO>CANADA RESEARCH CHAIR IN SIGNAL TRANSDUCTION IN ISCHEMIA</A> and <A name=PO>ASST. PROF., DEPT. OF BIOL. SCI., UNIV. OF TORONTO SCARBOROUGH 2006–  </A>; Postdoctoral Fellow, Toronto Western Hosp. 2000–06; Expert Cons., Auris Med. SAS, Montpellier, France; mem., Centre for the Neurobiol. of Stress; named INMHA Brainstar of the Year 2003; Bd. of Dirs. & Fundraising Chair, N'Sheemaehn Childcare; mem., Soc. for Neurosci.; Cdn. Physiol. Soc.; Cdn. Assn. for Neurosci.; <A name=WK>co-author: 'Therapeutic Tools in Brain Damage' in <EM>Proteomics and Protein Interactions: Biology, Chemistry, Bioinformatics and Drug Design </EM>2005; 18 pub. journal articles</A>; Office: <A name=OF1_L1>1265 Military Trail</A>, <A name=OF1_CT>Scarborough</A>, <A name=OF1_PR>Ont.</A> <A name=OF1_PC>M1C 1A4</A>. </BODY></HTML>

And the code behind I'm using is this

        Dim FullBio As New System.Xml.XmlDocument
        Dim NodeList As System.Xml.XmlNodeList
        Dim Node As System.Xml.XmlNode

        FullBio.LoadXml(bio.Item(11))
        NodeList = FullBio.SelectNodes("a")

        For Each Node In NodeList
            Dim name = Node.Attributes.GetNamedItem("name").Value()
            lblEducation.Text = lblEducation.Text + name.ToString() + Node.InnerText + "<br />"
        Next

So the XML loaded into the Xml Document at

FullBio.LoadXml(bio.Item(11))

is the XML I provided at the top. I am getting this error message:

'SN' is an unexpected token. The expected token is '"' or '''. Line 1, position 49.

I know that the error is because the attributes are not quoted. Is there anyway to make XmlDocument understand the attributes anyway or an easy way to use a reg expression to add quotes to the attributes before loading the string into the xmldoc?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

诠释孤独 2024-12-10 09:50:45

您拥有的是无效的 XML。 XmlDocument 期望输入是有效的 XML。我建议您使用 HTML 解析器,例如 Html Agility Pack 来解析 HTML(这就是您所拥有的)作为输入)。例如,如果您想列出所有锚点的所有 name 属性值,就这么简单:

using System;
using HtmlAgilityPack;

class Program
{
    static void Main()
    {
        var document = new HtmlDocument();
        document.Load("test.html");
        foreach (var a in document.DocumentNode.Descendants("a"))
        {
            Console.WriteLine("Name: {0}", a.Attributes["name"].Value);
        }
    }
}

What you have is invalid XML. An XmlDocument expects that the input is valid XML. I would recommend you using an HTML parser such as Html Agility Pack in order to parse HTML (which is what you have as input). So for example if you wanted to list all name attribute values for all anchors it's as simple as that:

using System;
using HtmlAgilityPack;

class Program
{
    static void Main()
    {
        var document = new HtmlDocument();
        document.Load("test.html");
        foreach (var a in document.DocumentNode.Descendants("a"))
        {
            Console.WriteLine("Name: {0}", a.Attributes["name"].Value);
        }
    }
}
So要识趣 2024-12-10 09:50:45

我会编写一些逻辑来在属性值周围插入引号。如果 XML 格式不正确,则加载文档时会出现错误。

您可以使用 Html2Xhtml 库来实现此目的。这是一个链接:

http://corsis.sourceforge.net/index.php/Html2Xhtml

并且您应该能够使用该库将内容放入 XDocument 中,如下所示:

string html = "<html><head><TITLE>title</TITLE></head><body>I♥NY<p>b<br>c:±<img src=2 nonsense=x></a><font size=2>c</font></body></html>";

var xdoc = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToXDocument(keepXhtmlNamespace: true);

Console.WriteLine(xdoc);

我相信 Html2Xhtml 支持 .NET 2.0 框架及更高版本,如果没有,我很确定以前的版本会,但如果没有,您可以使用:

http://www.codeproject .com/KB/XML/HTML2XHTML.aspx

本文使用 HTML Tidy,本文中的源代码应在 2.0 中运行。

I would write some logic to insert quotes around the attribute values. The document will load with errors if the XML isn't properly formatted.

You can use the Html2Xhtml library for this. Here is a link:

http://corsis.sourceforge.net/index.php/Html2Xhtml

And you should be able to use the library to put the contents into an XDocument, like this:

string html = "<html><head><TITLE>title</TITLE></head><body>I♥NY<p>b<br>c:±<img src=2 nonsense=x></a><font size=2>c</font></body></html>";

var xdoc = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToXDocument(keepXhtmlNamespace: true);

Console.WriteLine(xdoc);

I believe that Html2Xhtml supports .NET 2.0 framework and above, and if not I'm pretty sure that one of the previous versions will, but if not you can use this:

http://www.codeproject.com/KB/XML/HTML2XHTML.aspx

This article uses HTML Tidy, and the source code from this article should work in 2.0.

无人问我粥可暖 2024-12-10 09:50:45

你也可以尝试 SgmlReader,非常适合此类问题。

using (var strReader = new StringReader(html))
{
    using (SgmlReader sgmlReader = new SgmlReader())
    {
        sgmlReader.DocType = "HTML";
        sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
        sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
        sgmlReader.InputStream = strReader;

        // create document
        XmlDocument doc = new XmlDocument();
        doc.PreserveWhitespace = true;
        doc.Load(sgmlReader);
    }
}

Yuo can also try SgmlReader, great for this kind of problem.

using (var strReader = new StringReader(html))
{
    using (SgmlReader sgmlReader = new SgmlReader())
    {
        sgmlReader.DocType = "HTML";
        sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
        sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
        sgmlReader.InputStream = strReader;

        // create document
        XmlDocument doc = new XmlDocument();
        doc.PreserveWhitespace = true;
        doc.Load(sgmlReader);
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文