如何将 HTML 读取为 XML？

发布于 2024-10-27 01:54:31 字数 686 浏览 1 评论 0 原文

我想从从互联网下载的 html 页面中提取几个链接，我认为使用 linq to XML 对于我的情况来说是一个很好的解决方案。
我的问题是我无法从 HTML 创建 XmlDocument，使用 Load(string url) 不起作用，所以我使用以下方法将 html 下载到字符串中：

public static string readHTML(string url)
    {
        HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
        HttpWebResponse res = (HttpWebResponse)req.GetResponse();
        StreamReader sr = new StreamReader(res.GetResponseStream());

        string html = sr.ReadToEnd();
        sr.Close();
        return html;
    }

当我尝试使用 LoadXml(string xml) 加载该字符串时，我得到异常

'--' is an unexpected token. The expected token is '>'

我应该采取什么方式将html文件读取为可解析的XML

原文

I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.
My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:

public static string readHTML(string url)
    {
        HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
        HttpWebResponse res = (HttpWebResponse)req.GetResponse();
        StreamReader sr = new StreamReader(res.GetResponseStream());

        string html = sr.ReadToEnd();
        sr.Close();
        return html;
    }

When I try to load that string using LoadXml(string xml) I get the exception

'--' is an unexpected token. The expected token is '>'

What way should I take to read the html file to a parsable XML

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

混浊又暗下来 2024-11-03 01:54:31

HTML 与 XML 根本不同（除非 HTML 实际上恰好符合 XML 模式中的 XHTML 或 HTML5）。最好的方法是使用 HTML 解析器来读取 HTML。然后，您可以将其转换为 Linq to XML – 或直接处理它。

回复收藏 0 原文

盛夏尉蓝 2024-11-03 01:54:31

我自己没有使用过它，但我建议你看一下 SgmlReader。这是他们主页的示例：

// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
    DocType = "HTML",
    WhitespaceHandling = WhitespaceHandling.All,
    CaseFolding = Sgml.CaseFolding.ToLower,
    InputStream = reader
};

// create document
XmlDocument doc = new XmlDocument()
{
    PreserveWhitespace = true,
    XmlResolver = null
};
doc.Load(sgmlReader);
return doc;

I haven't used it myself, but I suggest you take a look at SgmlReader. Here's a sample from their home page:

// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
    DocType = "HTML",
    WhitespaceHandling = WhitespaceHandling.All,
    CaseFolding = Sgml.CaseFolding.ToLower,
    InputStream = reader
};

// create document
XmlDocument doc = new XmlDocument()
{
    PreserveWhitespace = true,
    XmlResolver = null
};
doc.Load(sgmlReader);
return doc;

回复收藏 0 原文

时光磨忆 2024-11-03 01:54:31

如果您想从页面中提取一些链接，正如您所提到的，请尝试使用 HTML 敏捷包。

此代码从网络获取页面并提取所有链接：

HtmlWeb web = new HtmlWeb();  
HtmlDocument document = web.Load("http://www.stackoverflow.com");  
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray();

从磁盘打开一个 html 文件并获取特定链接的 URL：

HtmlDocument document2 = new HtmlDocument();  
document2.Load(@"C:\Temp\page.html")  
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[@id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);

If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.

This code gets a page from the web and extracts all links:

HtmlWeb web = new HtmlWeb();  
HtmlDocument document = web.Load("http://www.stackoverflow.com");  
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray();

Open an html file from disk and get URL for specific link:

HtmlDocument document2 = new HtmlDocument();  
document2.Load(@"C:\Temp\page.html")  
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[@id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);

回复收藏 0 原文