无法使用 simplehtmldom 正确分隔单元格

发布于 2024-07-27 20:02:08 字数 4396 浏览 3 评论 0原文

我正在尝试编写一个网络爬虫。 我想将所有单元格排成一行。 我想要的行之前的行将 THOROUGHBRED MEETINGS 作为其纯文本值。 我可以成功获得这一行。 但我不知道如何获取下一行的子元素,即单元格或 标签。

if ($foundTag = FindTagByText("THOROUGHBRED MEETINGS", $html))
{
    $cell = $foundTag->parent();
    $row = $cell->parent();
    $nextRow = $row->next_sibling();
    echo "Row: ".$row->plaintext."<br />\n";
    echo "Next Row: ".$nextRow->plaintext."<br />\n";
    $cells = $nextRow->children();

    foreach ($cells as $cell)
    {
        echo "Cell: ".$cell->plaintext."<br />\n";
    }
}

function FindTagByText($text, $html)
{
    // Use Simple_HTML_DOM special selector 'text'
    // to retrieve all text nodes from the document
    $textNodes = $html->find('text');
    $foundTag = null;

    foreach($textNodes as $textNode) 
    {
        if($textNode->plaintext == $text) 
        {
            // Get the parent of the text node
            // (A text node is always a child of
            //  its container)
            $foundTag = $textNode->parent();
            break;
        }
    }

    return $foundTag;
}

这是我试图解析的 html:

<tr valign=top>
<td colspan=16 bgcolor=#999999><b>THOROUGHBRED MEETINGS</b></td>

</tr>
<tr valign=top bgcolor="#ffffff">
<td><b>BR</b> <a href="meeting?mtg=br&day=today&curtype=0">SUNSHINE COAST</a></td>
<td>FINE/DEAD</b></td>
<td><font color=#cc0000><b>R1</b></font>@<b>12:30pm</b></td>
<td align=center bgcolor=#cc0000><a href="odds?mting=BR01000"><b><font color=#ffffff>1</a></font></td>
<td align=center><a href="odds?mting=BR02000"><b><font color=black>2</b></font></a></td>
<td align=center><a href="odds?mting=BR03000"><b><font color=black>3</b></font></a></td>

<td align=center><a href="odds?mting=BR04000"><b><font color=black>4</b></font></a></td>
<td align=center><a href="odds?mting=BR05000"><b><font color=black>5</b></font></a></td>
<td align=center><a href="odds?mting=BR06000"><b><font color=black>6</b></font></a></td>
<td align=center><a href="odds?mting=BR07000"><b><font color=black>7</b></font></a></td>
<td align=center><a href="odds?mting=BR08000"><b><font color=black>8</b></font></a></td>
<td bgcolor="#ffffff" colspan=4>&nbsp;</td>
</tr>

这是我的输出:

Row: THOROUGHBRED MEETINGS
Next Row: BR SUNSHINE COAST FINE/DEAD R1@12:30pm 1 2 3 4 5 6 7 8   CR NEW ZEALAND FINE/DEAD R3@11:10am 1 2 3 4 5 6 7 8 9   DR HOBART OCAST/HVY R1@12:15pm 1 2 3 4 5 6 7   MR CRANBOURNE OCAST/SLOW R1@12:20pm 1 2 3 4 5 6 7 8   NR COFFS HARBOUR OCAST/SLOW R1@12:45pm 1 2 3 4 5 6 7 8   SR MORUYA FINE/GOOD R1@12:25pm 1 2 3 4 5 6 7 8   VR BENALLA OCAST/SLOW R1@12:35pm 1 2 3 4 5 6 7 8   XR KALGOORLIE FINE/GOOD R1@ 3:00pm 1 2 3 4 5 6 7     HARNESS MEETINGS DT LAUNCESTON SHWRY/GOOD R1@ 4:57pm 1 2 3 4 5 6 7 8 9 10   MT CRANBOURNE OCAST/GOOD R1@ 5:05pm 1 2 3 4 5 6 7 8     GREYHOUND MEETINGS AD GAWLER OCAST/GOOD R1@ 5:10pm 1 2 3 4 5 6 7 8 9 10 11   CD CANBERRA OCAST/GOOD R1@ 5:02pm 1 2 3 4 5 6 7 8 9 10 11   MD SALE FINE/GOOD R1@ 4:54pm 1 2 3 4 5 6 7 8 9 10 11 12
Cell: BR SUNSHINE COAST
Cell: FINE/DEAD
Cell: R1@12:30pm
Cell: 1 2 3 4 5 6 7 8   CR NEW ZEALAND FINE/DEAD R3@11:10am 1 2 3 4 5 6 7 8 9   DR HOBART OCAST/HVY R1@12:15pm 1 2 3 4 5 6 7   MR CRANBOURNE OCAST/SLOW R1@12:20pm 1 2 3 4 5 6 7 8   NR COFFS HARBOUR OCAST/SLOW R1@12:45pm 1 2 3 4 5 6 7 8   SR MORUYA FINE/GOOD R1@12:25pm 1 2 3 4 5 6 7 8   VR BENALLA OCAST/SLOW R1@12:35pm 1 2 3 4 5 6 7 8   XR KALGOORLIE FINE/GOOD R1@ 3:00pm 1 2 3 4 5 6 7     HARNESS MEETINGS DT LAUNCESTON SHWRY/GOOD R1@ 4:57pm 1 2 3 4 5 6 7 8 9 10   MT CRANBOURNE OCAST/GOOD R1@ 5:05pm 1 2 3 4 5 6 7 8     GREYHOUND MEETINGS AD GAWLER OCAST/GOOD R1@ 5:10pm 1 2 3 4 5 6 7 8 9 10 11   CD CANBERRA OCAST/GOOD R1@ 5:02pm 1 2 3 4 5 6 7 8 9 10 11   MD SALE FINE/GOOD R1@ 4:54pm 1 2 3 4 5 6 7 8 9 10 11 12 

I am trying to write a web scraper. I want to get all the cells in a row. The row before the one I want has THOROUGHBRED MEETINGS as its plain text value. I can successfully get this row. But I can't figure out how to get the next row's children which are the cells or <td> tags.

if ($foundTag = FindTagByText("THOROUGHBRED MEETINGS", $html))
{
    $cell = $foundTag->parent();
    $row = $cell->parent();
    $nextRow = $row->next_sibling();
    echo "Row: ".$row->plaintext."<br />\n";
    echo "Next Row: ".$nextRow->plaintext."<br />\n";
    $cells = $nextRow->children();

    foreach ($cells as $cell)
    {
        echo "Cell: ".$cell->plaintext."<br />\n";
    }
}

function FindTagByText($text, $html)
{
    // Use Simple_HTML_DOM special selector 'text'
    // to retrieve all text nodes from the document
    $textNodes = $html->find('text');
    $foundTag = null;

    foreach($textNodes as $textNode) 
    {
        if($textNode->plaintext == $text) 
        {
            // Get the parent of the text node
            // (A text node is always a child of
            //  its container)
            $foundTag = $textNode->parent();
            break;
        }
    }

    return $foundTag;
}

Here is the html I am trying to parse:

<tr valign=top>
<td colspan=16 bgcolor=#999999><b>THOROUGHBRED MEETINGS</b></td>

</tr>
<tr valign=top bgcolor="#ffffff">
<td><b>BR</b> <a href="meeting?mtg=br&day=today&curtype=0">SUNSHINE COAST</a></td>
<td>FINE/DEAD</b></td>
<td><font color=#cc0000><b>R1</b></font>@<b>12:30pm</b></td>
<td align=center bgcolor=#cc0000><a href="odds?mting=BR01000"><b><font color=#ffffff>1</a></font></td>
<td align=center><a href="odds?mting=BR02000"><b><font color=black>2</b></font></a></td>
<td align=center><a href="odds?mting=BR03000"><b><font color=black>3</b></font></a></td>

<td align=center><a href="odds?mting=BR04000"><b><font color=black>4</b></font></a></td>
<td align=center><a href="odds?mting=BR05000"><b><font color=black>5</b></font></a></td>
<td align=center><a href="odds?mting=BR06000"><b><font color=black>6</b></font></a></td>
<td align=center><a href="odds?mting=BR07000"><b><font color=black>7</b></font></a></td>
<td align=center><a href="odds?mting=BR08000"><b><font color=black>8</b></font></a></td>
<td bgcolor="#ffffff" colspan=4> </td>
</tr>

Here is my output:

Row: THOROUGHBRED MEETINGS
Next Row: BR SUNSHINE COAST FINE/DEAD R1@12:30pm 1 2 3 4 5 6 7 8   CR NEW ZEALAND FINE/DEAD R3@11:10am 1 2 3 4 5 6 7 8 9   DR HOBART OCAST/HVY R1@12:15pm 1 2 3 4 5 6 7   MR CRANBOURNE OCAST/SLOW R1@12:20pm 1 2 3 4 5 6 7 8   NR COFFS HARBOUR OCAST/SLOW R1@12:45pm 1 2 3 4 5 6 7 8   SR MORUYA FINE/GOOD R1@12:25pm 1 2 3 4 5 6 7 8   VR BENALLA OCAST/SLOW R1@12:35pm 1 2 3 4 5 6 7 8   XR KALGOORLIE FINE/GOOD R1@ 3:00pm 1 2 3 4 5 6 7     HARNESS MEETINGS DT LAUNCESTON SHWRY/GOOD R1@ 4:57pm 1 2 3 4 5 6 7 8 9 10   MT CRANBOURNE OCAST/GOOD R1@ 5:05pm 1 2 3 4 5 6 7 8     GREYHOUND MEETINGS AD GAWLER OCAST/GOOD R1@ 5:10pm 1 2 3 4 5 6 7 8 9 10 11   CD CANBERRA OCAST/GOOD R1@ 5:02pm 1 2 3 4 5 6 7 8 9 10 11   MD SALE FINE/GOOD R1@ 4:54pm 1 2 3 4 5 6 7 8 9 10 11 12
Cell: BR SUNSHINE COAST
Cell: FINE/DEAD
Cell: R1@12:30pm
Cell: 1 2 3 4 5 6 7 8   CR NEW ZEALAND FINE/DEAD R3@11:10am 1 2 3 4 5 6 7 8 9   DR HOBART OCAST/HVY R1@12:15pm 1 2 3 4 5 6 7   MR CRANBOURNE OCAST/SLOW R1@12:20pm 1 2 3 4 5 6 7 8   NR COFFS HARBOUR OCAST/SLOW R1@12:45pm 1 2 3 4 5 6 7 8   SR MORUYA FINE/GOOD R1@12:25pm 1 2 3 4 5 6 7 8   VR BENALLA OCAST/SLOW R1@12:35pm 1 2 3 4 5 6 7 8   XR KALGOORLIE FINE/GOOD R1@ 3:00pm 1 2 3 4 5 6 7     HARNESS MEETINGS DT LAUNCESTON SHWRY/GOOD R1@ 4:57pm 1 2 3 4 5 6 7 8 9 10   MT CRANBOURNE OCAST/GOOD R1@ 5:05pm 1 2 3 4 5 6 7 8     GREYHOUND MEETINGS AD GAWLER OCAST/GOOD R1@ 5:10pm 1 2 3 4 5 6 7 8 9 10 11   CD CANBERRA OCAST/GOOD R1@ 5:02pm 1 2 3 4 5 6 7 8 9 10 11   MD SALE FINE/GOOD R1@ 4:54pm 1 2 3 4 5 6 7 8 9 10 11 12 

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

2024-08-03 20:02:08

你不会喜欢我的回答。

不幸的是,您正在解析的 HTML 中不匹配的结束标记似乎令人困惑 Simple_HTML_DOM。 看一下这个代码片段:

<td align=center><a href="odds?mting=BR02000"><b><font color=black>2</b></font></a></td>

如果您遵循这个代码片段的标签顺序:

从技术上讲,标签应该以相反的顺序关闭,但它们的关闭方式是这样的:

  • 已关闭
  • 已关闭
  • 已关闭
  • 已关闭

您正在尝试的 HTML scrape 充满了这些错误,以及从未打开的标签的关闭标签。 Simple_HTML_DOM 无法正确解析这些文件。

恐怕如果您无法修改 HTML,则必须手动解析该文件,纠正任何错误。


请注意,我已针对以下更正后的 HTML 测试了您的代码,并且 Simple_HTML_DOM 成功解析了它,并且您的代码运行得很好。

<tr valign=top>
<td colspan=16 bgcolor=#999999><b>THOROUGHBRED MEETINGS</b></td>

</tr>
<tr valign=top bgcolor="#ffffff">
<td><b>BR</b> <a href="meeting?mtg=br&day=today&curtype=0">SUNSHINE COAST</a></td>
<td><b>FINE/DEAD</b></td>
<td><font color=#cc0000><b>R1</font></b>@<b>12:30pm</b></td>
<td align=center bgcolor=#cc0000><a href="odds?mting=BR01000"><b><font color=#ffffff>1</a></b></font></td>
<td align=center><a href="odds?mting=BR02000"><b><font color=black>2</font></b></a></td>
<td align=center><a href="odds?mting=BR03000"><b><font color=black>3</font></b></a></td>

<td align=center><a href="odds?mting=BR04000"><b><font color=black>4</font></b></a></td>
<td align=center><a href="odds?mting=BR05000"><b><font color=black>5</font></b></a></td>
<td align=center><a href="odds?mting=BR06000"><b><font color=black>6</font></b></a></td>
<td align=center><a href="odds?mting=BR07000"><b><font color=black>7</font></b></a></td>
<td align=center><a href="odds?mting=BR08000"><b><font color=black>8</font></b></a></td>
<td bgcolor="#ffffff" colspan=4> </td>
</tr>

编辑: 作为替代方案,您可能想尝试 DOMDocument::loadHTML 具有更好的结果。 它在 PHP 5 中可用,无需外部库。 查看官方文档

You will not like my answer.

Unfortunately, it seems that mismatched closing tags in the HTML you are parsing are confusing Simple_HTML_DOM. Take a look at this snippet:

<td align=center><a href="odds?mting=BR02000"><b><font color=black>2</b></font></a></td>

If you follow the order of tags of this snippet:

  • <td> is opened
  • <a> is opened
  • <b> is opened
  • <font> is opened

Technically, tags should be closed in the opposite order, but this is how they are closed:

  • </b> is closed
  • </font> is closed
  • </a> is closed
  • </td> is closed

The HTML you are trying to scrape is full of those mistakes, all well as closing tags for tags which are never opened. Simple_HTML_DOM doesn't parse those files properly.

I'm afraid that if you don't have the possibility of modifying the HTML, you'll have to parse the file manually, correcting any errors.


As a note, I've tested your code against the following corrected HTML, and Simple_HTML_DOM parsed it successfully, and your code worked just fine.

<tr valign=top>
<td colspan=16 bgcolor=#999999><b>THOROUGHBRED MEETINGS</b></td>

</tr>
<tr valign=top bgcolor="#ffffff">
<td><b>BR</b> <a href="meeting?mtg=br&day=today&curtype=0">SUNSHINE COAST</a></td>
<td><b>FINE/DEAD</b></td>
<td><font color=#cc0000><b>R1</font></b>@<b>12:30pm</b></td>
<td align=center bgcolor=#cc0000><a href="odds?mting=BR01000"><b><font color=#ffffff>1</a></b></font></td>
<td align=center><a href="odds?mting=BR02000"><b><font color=black>2</font></b></a></td>
<td align=center><a href="odds?mting=BR03000"><b><font color=black>3</font></b></a></td>

<td align=center><a href="odds?mting=BR04000"><b><font color=black>4</font></b></a></td>
<td align=center><a href="odds?mting=BR05000"><b><font color=black>5</font></b></a></td>
<td align=center><a href="odds?mting=BR06000"><b><font color=black>6</font></b></a></td>
<td align=center><a href="odds?mting=BR07000"><b><font color=black>7</font></b></a></td>
<td align=center><a href="odds?mting=BR08000"><b><font color=black>8</font></b></a></td>
<td bgcolor="#ffffff" colspan=4> </td>
</tr>

Edit: As an alternative, you might want to try if DOMDocument::loadHTML has better results. It is available in PHP 5 without external libraries. Check the official documentation.

三人与歌 2024-08-03 20:02:08

您将得到第一个 td,如下所示:

$firstTD = $row->first_child();

之后,您可以通过以下方式获得后续的 td:

$firstTD->next_sibling()

You'll get the first td like this:

$firstTD = $row->first_child();

After that you can get the subsequent ones with:

$firstTD->next_sibling()
記憶穿過時間隧道 2024-08-03 20:02:08

我通过放入 DOMDocument() 来纠正格式错误的 HTML,使其正常工作。

$url = "http://www.acttab.com.au/interbet/venues?day=today";

$doc = new DOMDocument();
$doc->loadHTMLFile($url);

//convert $doc to html
$html = str_get_html($doc->saveHTML());

I got it to work by putting into a DOMDocument() to correct the malformed HTML.

$url = "http://www.acttab.com.au/interbet/venues?day=today";

$doc = new DOMDocument();
$doc->loadHTMLFile($url);

//convert $doc to html
$html = str_get_html($doc->saveHTML());
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文