HTML C# html-content-extraction html-agility-pack

鉴于我将网页源存储在字符串变量中，如何在 C# 中读取 HTML 文档？

发布于 2024-12-18 00:04:46 字数 2687 浏览 1 评论 0原文

我曾尝试自己做这件事但做不到。

我有一个 html 文档，我正在尝试将其中所有图片的地址提取到 ac# 集合中，但我不确定语法。我正在使用 HTMLAgilityPack...这是我到目前为止所拥有的。请指教。

HTML 代码如下：

<div style='padding-left:12px;' id='myWeb123'>
<b>MyWebSite Pics</b>
<br /><br />
<img src="http://myWebSite.com/pics/HHTR_01.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_02.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_03.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_04.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_05.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_06.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_07.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_08.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_09.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_10.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<a href="http://www.myWebSite.com/" target="_blank" rel="nofollow">Source</a>
</div>

C# 代码如下：

HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();

document.Load("FileName.html");

// Targets a specific node
HtmlNode someNode = document.GetElementbyId("myWeb123");

//HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//a[@href]");

HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");

if (linkNodes != null)
{
    int count = 0;
    foreach(HtmlNode linkNode in linkNodes)
    {

        string linkTitle = linkNode.GetAttributeValue("src", string.Empty);

        Debug.Print("linkTitle = " + linkTitle);

        if (linkTitle == string.Empty)
        {
            HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
            if (imageNode != null)
            {
                Debug.Print("imageNode = " + imageNode.Attributes.ToString());
            }
        }
        count++;
        Debug.Print("count = " + count);
    }
}

我尝试使用 HtmlAgilityPack 文档，但该包缺乏示例，并且有关其方法和类的信息对于我来说真的很难理解，如果没有例子。

原文

I have tried to do this on my own but couldn't.

I have an html document, and I'm trying to extract the addresses for all the pictures in it into a c# collection and I'm not sure of the syntax. I'm using HTMLAgilityPack... Here is what I have so far. Please advise.

The HTML Code is the following:

<div style='padding-left:12px;' id='myWeb123'>
<b>MyWebSite Pics</b>
<br /><br />
<img src="http://myWebSite.com/pics/HHTR_01.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_02.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_03.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_04.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_05.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_06.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_07.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_08.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_09.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<img src="http://myWebSite.com/pics/HHTR_10.jpg" alt='myWebSitePics' title='myWebSitePics' /><br /><br />
<a href="http://www.myWebSite.com/" target="_blank" rel="nofollow">Source</a>
</div>

And the c# code is the following:

HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();

document.Load("FileName.html");

// Targets a specific node
HtmlNode someNode = document.GetElementbyId("myWeb123");

//HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//a[@href]");

HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");

if (linkNodes != null)
{
    int count = 0;
    foreach(HtmlNode linkNode in linkNodes)
    {

        string linkTitle = linkNode.GetAttributeValue("src", string.Empty);

        Debug.Print("linkTitle = " + linkTitle);

        if (linkTitle == string.Empty)
        {
            HtmlNode imageNode = linkNode.SelectSingleNode("img[@alt]");
            if (imageNode != null)
            {
                Debug.Print("imageNode = " + imageNode.Attributes.ToString());
            }
        }
        count++;
        Debug.Print("count = " + count);
    }
}

I tried to use the HtmlAgilityPack Documentation but this pack lacks examples and the information about its methods and classes are really hard for me to understand without examples.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

离笑几人歌 2024-12-25 00:04:46

试试这个，抱歉，如果它无法构建，我已经根据您的情况覆盖了我们的代码

List<string> result = new List<string>();
foreach (HtmlNode link in document.DocumentNode.SelectNodes("//img[@src]"))
{
    HtmlAttribute att = link.Attributes["src"];

    string temp = att.Value;
    string urlValue;
    do
    {
        urlValue = temp;
        temp = HttpUtility.UrlDecode(HttpUtility.HtmlDecode(urlValue));
    } while (temp != urlValue);

    result.Add(temp);
}

try this, sorry if it will not be buildable, I have overwritten our code to your situation

List<string> result = new List<string>();
foreach (HtmlNode link in document.DocumentNode.SelectNodes("//img[@src]"))
{
    HtmlAttribute att = link.Attributes["src"];

    string temp = att.Value;
    string urlValue;
    do
    {
        urlValue = temp;
        temp = HttpUtility.UrlDecode(HttpUtility.HtmlDecode(urlValue));
    } while (temp != urlValue);

    result.Add(temp);
}

回复收藏 0 原文

东风软 2024-12-25 00:04:46

您可以使用 Load 的重载，它需要一个 TextReader：（

document.Load(new StringReader(text));

我没有查看代码的其余部分，但这解决了“如果我已经将 HTML 放入字符串中了吗？”部分。）

You can use the overload of Load which takes a TextReader:

document.Load(new StringReader(text));

(I haven't looked over the rest of the code, but that addresses the "what do I do if I've already got the HTML in a string?" part.)

回复收藏 0 原文

梦回旧景 2024-12-25 00:04:46

在此行中：

HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");

您选择的是

节点，而不是其下的节点。试试这个来选择那些 img 节点：

HtmlNodeCollection linkNodes = document.DocumentNode
     .SelectNodes("//div[@id='myWeb123']/img");

至于选择语法，它与 XML 中使用的 XPath 相同。因此，如果您想要选择的示例，请搜索 XPath。

在这种情况下：

前导 / 从文档的根开始搜索（而不是从某个“当前节点”），
// 意味着下一个匹配可以位于任何深度而不是直接在根
div[@id='myWeb123'] 下搜索属性“id”且值为“myWeb123”的
节点'
这/img 搜索匹配的 div 节点正下方的 img 节点。

In this line:

HtmlNodeCollection linkNodes = document.DocumentNode.SelectNodes("//div[@id='myWeb123']");

you are selecting the <div> node, not the <img> nodes under it. Try this to select those img nodes:

HtmlNodeCollection linkNodes = document.DocumentNode
     .SelectNodes("//div[@id='myWeb123']/img");

As for the selection syntax, it's identical to XPath as used in XML. So search for XPath if you want examples of the selection.

In this case:

the leading / starts searching from the root of the document (instead of from some "currect node")
the // means that the next match can be at any depth instead of directly under the root
div[@id='myWeb123'] searches for a <div> node with an attribute 'id' that has value 'myWeb123'
the /img searches for an img node directly under the matched div node.

回复收藏 0 原文

粉红×色少女 2024-12-25 00:04:46

如果页面大小增长，像这样使用 Xpath 的成本将会很高。
最好的方法是将 html 反序列化为对象。您也不需要使用您正在使用的 Htmlagility 参考。使用streamreader加载HTML并使用Xmlserializer
使用XSD工具，首先转换为xsd，然后从xsd工具生成一个类

1)
C:\Program Files\Microsoft Visual Studio 9.0\VC>xsd /c /language:CS c:\xtest.xml

Microsoft (R) Xml Schemas/DataTypes support utility
[Microsoft (R) .NET Framework, Version 2.0.50727.3038]
Copyright (C) Microsoft Corporation. All rights reserved.
Writing file 'C:\Program Files\Microsoft Visual Studio 9.0\VC\xtest.xsd'.

2)
C:\Program Files\Microsoft Visual Studio 9.0\VC>xsd /c  xtest.xsd
Microsoft (R) Xml Schemas/DataTypes support utility
[Microsoft (R) .NET Framework, Version 2.0.50727.3038]
Copyright (C) Microsoft Corporation. All rights reserved.
Writing file 'C:\Program Files\Microsoft Visual Studio 9.0\VC\xtest.cs'.

，将该类导入到您的解决方案中，

html col = new html();
StreamReader reader = new StreamReader("c:\\test.html"); 
XmlSerializer ser = new XmlSerializer(typeof(html));
col = (html)ser.Deserialize(reader);

然后col对象将一次性包含所有img标签的src。

Using Xpath like this will be expensive if the page size grows.
Best is to deserialize the html to an object. You also dont need to use the Htmlagility reference that you are using. Load the HTML using streamreader and the use Xmlserializer
Use XSD tool , first to convert to xsd and then generate a class from the xsd tool

1)
C:\Program Files\Microsoft Visual Studio 9.0\VC>xsd /c /language:CS c:\xtest.xml

Microsoft (R) Xml Schemas/DataTypes support utility
[Microsoft (R) .NET Framework, Version 2.0.50727.3038]
Copyright (C) Microsoft Corporation. All rights reserved.
Writing file 'C:\Program Files\Microsoft Visual Studio 9.0\VC\xtest.xsd'.

2)
C:\Program Files\Microsoft Visual Studio 9.0\VC>xsd /c  xtest.xsd
Microsoft (R) Xml Schemas/DataTypes support utility
[Microsoft (R) .NET Framework, Version 2.0.50727.3038]
Copyright (C) Microsoft Corporation. All rights reserved.
Writing file 'C:\Program Files\Microsoft Visual Studio 9.0\VC\xtest.cs'.

Import this class to your solution

html col = new html();
StreamReader reader = new StreamReader("c:\\test.html"); 
XmlSerializer ser = new XmlSerializer(typeof(html));
col = (html)ser.Deserialize(reader);

The col object then will contain all the src of the img tags in one shot.

回复收藏 0 原文

~没有更多了~