使用 HTMLAgilityPack 解析 HTML

发布于 2024-12-18 13:43:48 字数 2259 浏览 1 评论 0原文

我尝试使用 HTML Agility Pack 解析以下 HTML。

这是代码返回的整个文件的片段：

<div class="story-body fnt-13 p20-b user-gen">
    <p>text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text </p>
    <div  class="gallery clr bdr aln-c js-no-shadow mod  cld">
        <div>
            <ol>
                <li class="fader-item aln-c ">
                    <div class="imageWrap m10-b">
                       &#8203;<img class="http://www.domain.com/picture.png| " src="http://www.domain.com/picture.png" alt="alt text" />
                    </div>
                    <p class="caption">caption text</p>
                </li>
            </ol>
        </div>
    </div >
    <p>text here text here text text here text here text text here text here text text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
</div>

我使用以下代码获取此代码片段（我知道这很混乱）

string url = "http://www.domain.com/story.html";
var webGet = new HtmlWeb();
var document = webGet.Load(url);

var links = document.DocumentNode
        .Descendants("div")
        .Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) //
        .SelectMany(div => div.Descendants("p"))
        .ToList();
int cn = links.Count;

HtmlAgilityPack.HtmlNodeCollection tl = document.DocumentNode.SelectNodes("/html[1]/body[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
    textBox1.AppendText(node.InnerText.Trim());
    textBox1.AppendText(System.Environment.NewLine);
}

代码循环遍历每个 p 并（目前）附加它到一个文本框。除了带有类 gallery clr bdr aln-c js-no-shadow mod cld 的 div 标记之外，所有内容都正常工作。这段 HTML 的结果是我得到了  和标题文本位。

从结果中省略它的最佳方法是什么？

原文

I have the following HTML that I'm trying to parse using the HTML Agility Pack.

This is a snippet of the whole file that is returned by the code:

<div class="story-body fnt-13 p20-b user-gen">
    <p>text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text </p>
    <div  class="gallery clr bdr aln-c js-no-shadow mod  cld">
        <div>
            <ol>
                <li class="fader-item aln-c ">
                    <div class="imageWrap m10-b">
                       <img class="http://www.domain.com/picture.png| " src="http://www.domain.com/picture.png" alt="alt text" />
                    </div>
                    <p class="caption">caption text</p>
                </li>
            </ol>
        </div>
    </div >
    <p>text here text here text text here text here text text here text here text text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
</div>

I get this snippet of code using the following (which is messy i know)

string url = "http://www.domain.com/story.html";
var webGet = new HtmlWeb();
var document = webGet.Load(url);

var links = document.DocumentNode
        .Descendants("div")
        .Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) //
        .SelectMany(div => div.Descendants("p"))
        .ToList();
int cn = links.Count;

HtmlAgilityPack.HtmlNodeCollection tl = document.DocumentNode.SelectNodes("/html[1]/body[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
    textBox1.AppendText(node.InnerText.Trim());
    textBox1.AppendText(System.Environment.NewLine);
}

The code loops through each p and (for now) appends it to a textbox. All is working correctly other than the div tag with the class gallery clr bdr aln-c js-no-shadow mod cld. The result of this bit of HTML is that I get the and caption text bits.

what's the best way to omit that from the results?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

天荒地未老 2024-12-25 13:43:48

XPATH 是你的朋友。试试这个，忘记那个蹩脚的 xlink 语法:-)

HtmlNodeCollection tl = document.DocumentNode.SelectNodes("//p[not(@*)]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
    Console.WriteLine(node.InnerText.Trim());
}

这个表达式将选择所有没有设置任何属性的 P 节点。有关其他示例，请参阅此处：XPath 语法

XPATH is your friend. Try this and forget about that crappy xlink syntax :-)

HtmlNodeCollection tl = document.DocumentNode.SelectNodes("//p[not(@*)]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
    Console.WriteLine(node.InnerText.Trim());
}

This expression will select all P nodes that don't have any attributes set. See here for other samples: XPath Syntax

回复收藏 0 原文

緦唸λ蓇 2024-12-25 13:43:48

不太清楚你在问什么。我认为您是在问如何获取特定 div 的直接后代。如果是这种情况，请使用ChildNodes 而不是后代。也就是说：

.SelectMany(div => div.ChildNodes().Where(n => n.Name == "p"))

问题在于“后代”对文档树进行了完全递归遍历。

It's not quite clear what you're asking. I think you're asking how to get just the direct descendants of a particular div. If that's the case, then use ChildNodes rather than Descendants. That is:

.SelectMany(div => div.ChildNodes().Where(n => n.Name == "p"))

The problem is that Descendants does a fully recursive walk of the document tree.

回复收藏 0 原文

~没有更多了~

关于作者

热鲨

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

使用 HTMLAgilityPack 解析 HTML

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

使用 HTMLAgilityPack 解析 HTML

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

忆悲凉

hgfg1645

qq_qLPLYi

戏舞

殊姿

﹂绝世的画

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。