无法弄清楚如何使用 Html Agility Pack 读取网页的特定部分

发布于 2024-12-29 15:49:09 字数 5634 浏览 0 评论 0原文

我正在尝试阅读网站(www.joindota.com)的特定部分,该部分具有相同的值。我将使用该网站的示例解释我想要执行的操作:

以下 HTML 是我想要从该网站读取的内容的一部分:

<div id="matchticker_coverage_content_1761" style="display:none;">
    <a href="http://www.joindota.com/en/matches/16102-team-dignitas-dota-vs-sk-gaming-dota" class="item">
        <div class="sub" style="width: 18px; text-align: left;"><img src="http://www.gs-media.de/img/themes/joindota/ticker_9.png" border="0" alt="" /></div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/ro.gif" border="0" alt="ro" title="Romania" /> Digni</div>
        <div class="sub" style="width: 20px;">vs.</div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/dk.gif" border="0" alt="dk" title="Denmark" /> SK</div>
        <div class="sub" style="float: right; text-align: right;">
            <span title="Sun, 29.01.2012, 16:00 CET">tomorrow</span>
        </div>
        <div class="cl"></div>
    </a>
    <a href="http://www.joindota.com/en/matches/16101-world-elite-vs-mineski" class="item">
        <div class="sub" style="width: 18px; text-align: left;"><img src="http://www.gs-media.de/img/themes/joindota/ticker_9.png" border="0" alt="" /></div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/cn.gif" border="0" alt="cn" title="China" /> WE</div>
        <div class="sub" style="width: 20px;">vs.</div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/ph.gif" border="0" alt="ph" title="Philippines" /> Mski</div>
        <div class="sub" style="float: right; text-align: right;">
            <span title="Sun, 29.01.2012, 14:00 CET">tomorrow</span>
        </div>
        <div class="cl"></div>
    </a>
    ....
</div>

我想读取

中的所有内容

我只需要读取我在那里提供的

标签内的所有值。例如,它将输出:

  • Digni vs. SK
  • WS vs. Mski
  • EG vs. Fnatic
  • 等。

该 HTML 中的所有 div 值都是相同的,我只需要知道如何“选择”

具体在页面中,并读取该 div 中的所有其他 div,即:

<div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/ro.gif" border="0" alt="ro" title="Romania" /> Digni</div>
div class="sub" style="width: 20px;">vs.</div>
                        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/dk.gif" border="0" alt="dk" title="Denmark" /> SK</div>

所有

值是相同的,我感兴趣的是其中的文本,例如 Digni、vs. 和 SK。

我只需要读取

中的所有这些值

原因是因为该网站有很多这样的内容,但我只需要阅读特定的部分。这是同一页面上的另一个部分,它是相同的,只是所有其他 div 所在的 div 不同。

示例:

<div id="matchticker_coverage_content_1596" style="display:none;">
    <a href="http://www.joindota.com/en/matches/16564-westernwolves-vs-panzer" class="item">
        <div class="sub" style="width: 18px; text-align: left;"><img src="http://www.gs-media.de/img/themes/joindota/ticker_9.png" border="0" alt="" /></div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/fr.gif" border="0" alt="fr" title="France" /> Wolves</div>
        <div class="sub" style="width: 20px;">vs.</div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/de.gif" border="0" alt="de" title="Germany" /> PANZER</div>
        <div class="sub" style="float: right; text-align: right;">
            <span title="Tue, 31.01.2012, 21:00 CET">31.01.</span>
        </div>
        <div class="cl"></div>
    </a>
    <a href="http://www.joindota.com/en/matches/16626-panzer-vs-just-4-the-tournament" class="item">
        <div class="sub" style="width: 18px; text-align: left;"><img src="http://www.gs-media.de/img/themes/joindota/ticker_9.png" border="0" alt="" /></div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/de.gif" border="0" alt="de" title="Germany" /> PANZER</div>
        <div class="sub" style="width: 20px;">vs.</div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/de.gif" border="0" alt="de" title="Germany" /> J4T</div>
        <div class="sub" style="float: right; text-align: right;">
            <span title="Sun, 29.01.2012, 19:00 CET">tomorrow</span>
        </div>
        <div class="cl"></div>
    </a>
    ....
</div>

请注意开头

中的所有
是如何完全相同的?所有
所在的

我的最终问题是,如何选择包含另一个

的开头
>,并阅读我之前提到的那些具体内容?

I am trying to read a specific part of a website (www.joindota.com) that has the same values all over. I'll explain what I want to do with examples from the site:

The following HTML is a portion of what I want to read from the site:

<div id="matchticker_coverage_content_1761" style="display:none;">
    <a href="http://www.joindota.com/en/matches/16102-team-dignitas-dota-vs-sk-gaming-dota" class="item">
        <div class="sub" style="width: 18px; text-align: left;"><img src="http://www.gs-media.de/img/themes/joindota/ticker_9.png" border="0" alt="" /></div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/ro.gif" border="0" alt="ro" title="Romania" /> Digni</div>
        <div class="sub" style="width: 20px;">vs.</div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/dk.gif" border="0" alt="dk" title="Denmark" /> SK</div>
        <div class="sub" style="float: right; text-align: right;">
            <span title="Sun, 29.01.2012, 16:00 CET">tomorrow</span>
        </div>
        <div class="cl"></div>
    </a>
    <a href="http://www.joindota.com/en/matches/16101-world-elite-vs-mineski" class="item">
        <div class="sub" style="width: 18px; text-align: left;"><img src="http://www.gs-media.de/img/themes/joindota/ticker_9.png" border="0" alt="" /></div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/cn.gif" border="0" alt="cn" title="China" /> WE</div>
        <div class="sub" style="width: 20px;">vs.</div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/ph.gif" border="0" alt="ph" title="Philippines" /> Mski</div>
        <div class="sub" style="float: right; text-align: right;">
            <span title="Sun, 29.01.2012, 14:00 CET">tomorrow</span>
        </div>
        <div class="cl"></div>
    </a>
    ....
</div>

I want to read everything from <div id="matchticker_coverage_content_1761" >

I just need to read all the values within the <div> tags that I supplied there. For example, it would output:

  • Digni vs. SK
  • WS vs. Mski
  • EG vs. Fnatic
  • etc.

All the div values are the same within that HTML, I just need to know how to "select" <div id="matchticker_coverage_content_1761" > specifically in the page, and read all the other divs within that div, which is just the ones:

<div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/ro.gif" border="0" alt="ro" title="Romania" /> Digni</div>
div class="sub" style="width: 20px;">vs.</div>
                        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/dk.gif" border="0" alt="dk" title="Denmark" /> SK</div>

All the the <div> values are the same, all I am interested in is the text within them like the Digni and the vs. and the SK, for example.

I just need to read all of those values within the <div id="matchticker_coverage_content_1761" > </div>

The reason is because the site has many of these, but I only need to read a specific part. Here is another part on the same page that is identical, only the the div where all the other divs are in is different.

Example:

<div id="matchticker_coverage_content_1596" style="display:none;">
    <a href="http://www.joindota.com/en/matches/16564-westernwolves-vs-panzer" class="item">
        <div class="sub" style="width: 18px; text-align: left;"><img src="http://www.gs-media.de/img/themes/joindota/ticker_9.png" border="0" alt="" /></div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/fr.gif" border="0" alt="fr" title="France" /> Wolves</div>
        <div class="sub" style="width: 20px;">vs.</div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/de.gif" border="0" alt="de" title="Germany" /> PANZER</div>
        <div class="sub" style="float: right; text-align: right;">
            <span title="Tue, 31.01.2012, 21:00 CET">31.01.</span>
        </div>
        <div class="cl"></div>
    </a>
    <a href="http://www.joindota.com/en/matches/16626-panzer-vs-just-4-the-tournament" class="item">
        <div class="sub" style="width: 18px; text-align: left;"><img src="http://www.gs-media.de/img/themes/joindota/ticker_9.png" border="0" alt="" /></div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/de.gif" border="0" alt="de" title="Germany" /> PANZER</div>
        <div class="sub" style="width: 20px;">vs.</div>
        <div class="sub" style="width: 103px;"><img src="http://www.gs-media.de/img/flags/de.gif" border="0" alt="de" title="Germany" /> J4T</div>
        <div class="sub" style="float: right; text-align: right;">
            <span title="Sun, 29.01.2012, 19:00 CET">tomorrow</span>
        </div>
        <div class="cl"></div>
    </a>
    ....
</div>

Notice how all the <div> are exactly the same within the beginning <div>? that <div> where all the <div> are in is <div id="matchticker_coverage_content_1596" style="display:none;"> which is different to the other part of the page where it's <div id="matchticker_coverage_content_1761" style="display:none;">

My ultimate question is, how do I select that beginning <div> that holds the other <div>, and read those specific ones I mentioned earlier?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

海的爱人是光 2025-01-05 15:49:10

网络爬行/蜘蛛抓取,无论是否是 语义 HTML,对于特定数据点(如与谷歌等一般相关性搜索相反),更多的是艺术而不是科学。

您通常需要专门针对您想要抓取的每个网站定制您的抓取工具,以便获取与您想要从中获取相同数据点的数据每个站点,但每个站点以不同的方式表示它们

考虑到这一点,这通常是一种发现模式的练习,可以让您在某个站点内一致地识别数据点。

我冒昧地砍掉你的 HTML 示例,并对其进行格式化以显示标签的层次结构;这不会影响页面的解析或显示方式,因为它是关于标签内的内容,而不是标签外的内容。

通过重新排列,图案应该出现。

让我们首先识别容器

元素。唯一标识这些

元素的是 id 属性,它们的形式都是:(
<div id="matchticker_coverage_content_**some number**" style="display:none;">

注意:您可以看看对于任何具有 style="display:none;" 属性的

元素,但这非常脆弱,并且不是唯一标识容器,可以应用该属性任何地方否则并且具有没有语义)

不幸的是,id属性是一个问题,因为它的结尾似乎是一个数字是某种 id,并且在整个页面中不一致。如果您知道容器的 ID,则可以在 Html Agility Pack 中使用以下表达式:

\\div[id='matchticker_coverage_content_1596']

但我想您知道它。

真正想要的是能够查找

元素,其中id属性开头matchticker_coverage_content_

Html Agility Pack 不支持这种选择器语法。但是, jQuery 确实 使用以下语法:

div[id^='matchticker_coverage_content_']

更好的是 fizzler 项目 确实 支持此选择器。因此,在这种情况下,我将使用 fizzler 来获取该容器。

一旦有了容器,就需要查看它的子元素了。同样,编辑后,很明显您要查找的每个匹配(带有相关的

/a

或者只是 a 的选择器在容器节点上(如果使用 fizzler)

一旦有了它,您实际上不需要检测 vs.,您可以假设它在那里,您确实想要检测玩家

这些更难,因为标签、类或 id 没有任何语义。然而,有一个歧视因素。查看播放器标签(我删除了其中一些标签以使其更清晰):

<div class="sub">
    <img src="http://www.gs-media.de/img/themes/joindota/ticker_9.png" 
        border="0" alt="" /></div>

<div class="sub">
    <img src="http://www.gs-media.de/img/flags/ro.gif" 
        border="0" alt="ro" title="Romania" /> Digni</div>

<div class="sub" style="width: 20px;">vs.</div>

<div class="sub" style="width: 103px;">
    <img src="http://www.gs-media.de/img/flags/dk.gif" 
        border="0" alt="dk" title="Denmark" /> SK</div>

您可以看到播放器位于

标签中,该标签有一个子 <; img> 标记,其中 alt 属性不为空(这很重要,因为您不想处理第一个

; 元素)。

一旦识别出这些 标签,您就可以简单地获取父节点(

)并从节点中获取文本来获取播放器。第一个是第一个玩家方,您处理的第二个是第二个玩家方。

另一种方法是识别包含“vs.”的

元素。文本,然后看兄弟姐妹,前面的是第一玩家,后面的是第二玩家。

请注意,最后一步非常脆弱,并且它永远会很脆弱,因为标签中没有语义指示符。您本质上取决于实现细节(因为您别无选择)。

强烈建议您在某些页面周围设置测试用例,在其中解析内容并验证数据;这样,如果页面结构发生变化,您会立即知道并可以相应地更改您的抓取逻辑。

Web-crawling/spidering, whether it's of semantic HTML or not, for specific data points (as a opposed to a general relevance search like Google), is more art than science.

You more often than not have to tailor your crawler specifically for each site that you want to crawl in order to get the data for as you want to get the same data points from each site, but each site represents them differently.

With that in mind, it's usually an exercise in spotting the patterns that will allow you to identify the data points consistently within a certain site.

I've taken the liberty of chopping down your HTML samples, as well as formatting it to show the hierarchy of tags; this wouldn't affect how the page is parsed or displayed, as it's about what's in the tags, not outside of them.

With that rearrangement, the patterns should appear.

Let's take identifying the container <div> elements first. The thing that uniquely identifies these <div> elements is the id attribute, they all are of the form:

<div id="matchticker_coverage_content_**some number**" style="display:none;">

(Note: you could look for any <div> element with a style="display:none;" attribute, but that's very brittle, and is not uniquely identifying the container, that attribute could be applied anywhere else and has no semantic meaning)

Unfortunately, the id attribute is a problem, because it seems that the end of it is a number that is an id of some sort, and not consistent throughout the pages. If you knew the id of the container, you could just use the following expression with Html Agility Pack:

\\div[id='matchticker_coverage_content_1596']

But I imagine that you don't know it.

What you really want is the ability to look for all <div> elements where the id attribute starts with matchticker_coverage_content_.

Html Agility Pack doesn't support this kind of selector syntax. However, jQuery does with the following syntax:

div[id^='matchticker_coverage_content_']

What's even better is that the fizzler project does support this selector. So in that case, I'd use fizzler to get that container.

Once you have the container, it's a matter of looking through it's child elements. Again, after the edit, it should be obvious that each matchup you are looking for (with the relevant <div> elements) are contained inside an anchor (i.e. <a>) element. So once you have the container <div>, you can simply select out all child elements of that are anchor elements with the following syntax:

/a

(or just a selector of a on the container node if using fizzler)

Once you have that, you don't really need to detect the vs., you can assume it's there, you really want to detect the players.

These are harder, because there's nothing semantic about the tags, the classes, or the ids. However, there is a discriminator. Looking at the player tags (I've chopped some of this down to make it more clear):

<div class="sub">
    <img src="http://www.gs-media.de/img/themes/joindota/ticker_9.png" 
        border="0" alt="" /></div>

<div class="sub">
    <img src="http://www.gs-media.de/img/flags/ro.gif" 
        border="0" alt="ro" title="Romania" /> Digni</div>

<div class="sub" style="width: 20px;">vs.</div>

<div class="sub" style="width: 103px;">
    <img src="http://www.gs-media.de/img/flags/dk.gif" 
        border="0" alt="dk" title="Denmark" /> SK</div>

You can see that the players are in <div> tags which have a child <img> tag where the alt attribute is not empty (this is important, as you don't want to process the first <div> element).

Once you identify those <img> tags, you can simply get the parent node (the <div>) and take the text from the node to get your player. The first one is the first player side, the second one you process is the second player side.

An alternate approach would be to identify the <div> element that contains the "vs." text in it, and then look at the siblings, the one prior is the first player, while the one after is the second player.

Note, the last step is very brittle, and it will always be brittle because there's no semantic indicators in the tags. You're essentially depending on an implementation detail (because you have no other choice).

I strongly recommend that you have test cases around certain pages where you parse the content and verify the data; this way, if the structure of the page changes, you will know immediately and can change your scraping logic accordingly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文