如何将 HTML 读取为 XML?

发布于 2024-10-27 01:54:31 字数 686 浏览 1 评论 0 原文

我想从从互联网下载的 html 页面中提取几个链接,我认为使用 linq to XML 对于我的情况来说是一个很好的解决方案。
我的问题是我无法从 HTML 创建 XmlDocument,使用 Load(string url) 不起作用,所以我使用以下方法将 html 下载到字符串中:

public static string readHTML(string url)
    {
        HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
        HttpWebResponse res = (HttpWebResponse)req.GetResponse();
        StreamReader sr = new StreamReader(res.GetResponseStream());

        string html = sr.ReadToEnd();
        sr.Close();
        return html;
    }

当我尝试使用 LoadXml(string xml) 加载该字符串时,我得到异常

'--' is an unexpected token. The expected token is '>'

我应该采取什么方式将html文件读取为可解析的XML

I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.
My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:

public static string readHTML(string url)
    {
        HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
        HttpWebResponse res = (HttpWebResponse)req.GetResponse();
        StreamReader sr = new StreamReader(res.GetResponseStream());

        string html = sr.ReadToEnd();
        sr.Close();
        return html;
    }

When I try to load that string using LoadXml(string xml) I get the exception

'--' is an unexpected token. The expected token is '>'

What way should I take to read the html file to a parsable XML

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

混浊又暗下来 2024-11-03 01:54:31

HTML 与 XML 根本不同(除非 HTML 实际上恰好符合 XML 模式中的 XHTML 或 HTML5)。最好的方法是使用 HTML 解析器 来读取 HTML。然后,您可以将其转换为 Linq to XML – 或直接处理它。

HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.

盛夏尉蓝 2024-11-03 01:54:31

我自己没有使用过它,但我建议你看一下 SgmlReader。这是他们主页的示例:

// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
    DocType = "HTML",
    WhitespaceHandling = WhitespaceHandling.All,
    CaseFolding = Sgml.CaseFolding.ToLower,
    InputStream = reader
};

// create document
XmlDocument doc = new XmlDocument()
{
    PreserveWhitespace = true,
    XmlResolver = null
};
doc.Load(sgmlReader);
return doc;

I haven't used it myself, but I suggest you take a look at SgmlReader. Here's a sample from their home page:

// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
    DocType = "HTML",
    WhitespaceHandling = WhitespaceHandling.All,
    CaseFolding = Sgml.CaseFolding.ToLower,
    InputStream = reader
};

// create document
XmlDocument doc = new XmlDocument()
{
    PreserveWhitespace = true,
    XmlResolver = null
};
doc.Load(sgmlReader);
return doc;
时光磨忆 2024-11-03 01:54:31

如果您想从页面中提取一些链接,正如您所提到的,请尝试使用 HTML 敏捷包

此代码从网络获取页面并提取所有链接:

HtmlWeb web = new HtmlWeb();  
HtmlDocument document = web.Load("http://www.stackoverflow.com");  
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray(); 

从磁盘打开一个 html 文件并获取特定链接的 URL:

HtmlDocument document2 = new HtmlDocument();  
document2.Load(@"C:\Temp\page.html")  
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[@id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);

If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.

This code gets a page from the web and extracts all links:

HtmlWeb web = new HtmlWeb();  
HtmlDocument document = web.Load("http://www.stackoverflow.com");  
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray(); 

Open an html file from disk and get URL for specific link:

HtmlDocument document2 = new HtmlDocument();  
document2.Load(@"C:\Temp\page.html")  
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[@id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);
污味仙女 2024-11-03 01:54:31

HTML 不是 XML。 HTML 基于 SGML,因此不能确保标记是格式良好的 XML(XML 是 SGML 本身的子集)。您只能将 XHTML(即 XML 兼容的 HTML)解析为 XML。但当然,大多数网站的情况并非如此。

要使用 HTML,您需要使用 HTML 解析器。

HTML is not XML. HTML is based on SGML, and as such does not ensure that the markup is well-formed XML (XML is a subset of SGML itself). You can only parse XHTML, i.e. XML compatible HTML, as XML. But of course that is not the case for most of the websites.

To work with HTML, you need to use a HTML parser.

浅黛梨妆こ 2024-11-03 01:54:31

如果您知道您感兴趣的节点,我将使用正则表达式从字符串中提取链接。

If you know the nodes you're interested in I would use regex to extract the links from the string.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文