从表中的每个第一个 TD 中提取内容

发布于 2024-09-27 17:05:42 字数 1342 浏览 3 评论 0 原文

我有一些如下所示的 HTML:

<tr class="row-even">
    <td align="center">abcde</td>
    <td align="center"><a href="deluserconfirm.html?user=abcde"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></a></td>
</tr>
<tr class="row-odd">
    <td align="center">efgh</td>
    <td align="center"><a href="deluserconfirm.html?user=efgh"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></a></td>
</tr>
<tr class="row-even">
    <td align="center">ijkl</td>
    <td align="center"><a href="deluserconfirm.html?user=ijkl"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></a></td>
</tr>

我需要检索值 abcdeefghijkl

这是正则表达式我目前正在使用:

preg_match_all('/(<tr class="row-even">|<tr class="row-odd">)<td align="center">(.*)<\/td><\/tr>/xs', $html, $matches);

是的,我不太擅长它们。与我的大多数正则表达式尝试一样,这是行不通的。谁能告诉我为什么?

此外,我了解 html/xml 解析器,但需要重新访问大量代码才能实现这一点。那是以后的事情了。我们现在需要坚持使用正则表达式。

编辑:为了澄清,我需要第一个 标记之间的值

I've got some HTML that looks like this:

<tr class="row-even">
    <td align="center">abcde</td>
    <td align="center"><a href="deluserconfirm.html?user=abcde"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></a></td>
</tr>
<tr class="row-odd">
    <td align="center">efgh</td>
    <td align="center"><a href="deluserconfirm.html?user=efgh"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></a></td>
</tr>
<tr class="row-even">
    <td align="center">ijkl</td>
    <td align="center"><a href="deluserconfirm.html?user=ijkl"><img src="../images/delete_x.gif" alt="Delete User" border="none" /></a></td>
</tr>

And I need to retrieve the values, abcde, efgh, and ijkl

This is the regex I'm currently using:

preg_match_all('/(<tr class="row-even">|<tr class="row-odd">)<td align="center">(.*)<\/td><\/tr>/xs', $html, $matches);

Yes, I'm not very good at them. As with most of my regex attempts, this is not working. Can anyone tell me why?

Also, I know about html/xml parsers, but it would require a significant code revisit to make that happen. So that's for later. We need to stick with regex for now.

EDIT: To clarify, I need the values between the first <td align="center"></td> tag after either <tr class="row-even"> or <tr class="row-odd">

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

醉城メ夜风 2024-10-04 17:05:42
~<tr class="row-(even|odd)">\s*<td align="center">(.*?)</td>~m

注意 m 修饰符以及 \s* 的使用。

另外,您可以通过 ?: 使第一组不捕获。即 (?:even|odd) 因为您可能对 class 属性不感兴趣:)

~<tr class="row-(even|odd)">\s*<td align="center">(.*?)</td>~m

Notice the m modifier and the use of \s*.

Also, you can make the first group non-capturing via ?:. I.e., (?:even|odd) as you're probably not interested in the class attribute :)

泪眸﹌ 2024-10-04 17:05:42

试试这个:

preg_match_all('/(?:<tr class="row-even">|<tr class="row-odd">).<td align="center">(.*?)<\/td>/s', $html, $matches);

所做的更改:

  • 您没有考虑换行符
    标签之间
  • 您不需要 x 修饰符,因为它
    将丢弃正则表达式中的空格。
  • 通过使用使匹配非贪婪
    .*? 代替 .*

工作链接

Try this:

preg_match_all('/(?:<tr class="row-even">|<tr class="row-odd">).<td align="center">(.*?)<\/td>/s', $html, $matches);

Changes made:

  • You've not accounted for the newline
    between the tags
  • You don't need to x modifier as it
    will discard the space in the regex.
  • Make the matching non-greedy by using
    .*? in place of .*.

Working link

唔猫 2024-10-04 17:05:42

实际上,您不需要对代码库进行太大的更改。使用 DOM 和 XPath 获取文本节点始终相同。唯一发生变化的是 XPath,因此您可以将 DOM 代码包装到一个函数中来替换 preg_match_all。这只是一个微小的变化,例如

include_once "dom.php";
$matches = dom_match_all('//tr/td[1]', $html);

dom.php 只包含:

// dom.php
function dom_match_all($query, $html, array $matches = array()) {
    $dom = new DOMDocument;
    libxml_use_internal_errors(TRUE);
    $dom->loadHTML($html);
    libxml_clear_errors();
    $xPath = new DOMXPath($dom);
    foreach( $xPath->query($query) as $node ) {
        $matches[] = $node->nodeValue;
    }
    return $matches;
}

并将返回

Array
(
    [0] => abcde
    [1] => efgh
    [2] => ijkl
)

但如果您想要正则表达式,请使用正则表达式。我只是提供想法。

Actually, you dont need a too big change in your codebase. Fetching Text Nodes is always the same with DOM and XPath. All that does change is the XPath, so you could wrap the DOM code into a function that replaces your preg_match_all. That would be just a tiny change, e.g.

include_once "dom.php";
$matches = dom_match_all('//tr/td[1]', $html);

where dom.php just contains:

// dom.php
function dom_match_all($query, $html, array $matches = array()) {
    $dom = new DOMDocument;
    libxml_use_internal_errors(TRUE);
    $dom->loadHTML($html);
    libxml_clear_errors();
    $xPath = new DOMXPath($dom);
    foreach( $xPath->query($query) as $node ) {
        $matches[] = $node->nodeValue;
    }
    return $matches;
}

and would return

Array
(
    [0] => abcde
    [1] => efgh
    [2] => ijkl
)

But if you want a Regex, use a Regex. I am just giving ideas.

穿越时光隧道 2024-10-04 17:05:42

这是我想出来的,

<td align="center">([^<]+)</td>

我会解释一下。这里的挑战之一是标签之间的内容可能是您正在查找的文本,也可能是标签。在正则表达式中,[^<]+ 表示匹配不是 的一个或多个字符。特点。这很好,因为这意味着不会匹配,并且组只会匹配,直到找到标签。

This is what I came up with

<td align="center">([^<]+)</td>

I'll explain. One of the challenges here is what's between the tags could be either the text you're looking for, or an tag. In the regex the [^<]+ says to match one or more characters that is not the < character. That's great, because that means the won't match, and the the group will only match until the tag is found.

我不是你的备胎 2024-10-04 17:05:42

免责声明:使用正则表达式解析 HTML 是危险的。

要获取每个 TR 中第一个 TD 的内部html,请使用以下正则表达式:

/<tr[^>]*>\s*<td[^>]>(.+?)<\/td>/si

Disclaimer: Using regexps to parse HTML is dangerous.

To get the innerhtml of the first TD in each TR, use this regexp:

/<tr[^>]*>\s*<td[^>]>(.+?)<\/td>/si
花落人断肠 2024-10-04 17:05:42

这只是一个快速而肮脏的正则表达式来满足您的需求。它可以很容易地清理和优化,但这只是一个开始。

<tr[^>]+>[^\n]*\n               #Match the opening <tr> tag
  \s*<td[^>]+>([^<]+)[^\n]+\n   #Group the wanted data
  [^\n]+\n                      #Match next line
</tr>                           #Match closing tag

这是另一种方法,可能更可靠:

deluserconfirm.html\?user=([^"]+)

This is just a quick and dirty regex to meet your needs. It could easily be cleaned up and optimized, but it's a start.

<tr[^>]+>[^\n]*\n               #Match the opening <tr> tag
  \s*<td[^>]+>([^<]+)[^\n]+\n   #Group the wanted data
  [^\n]+\n                      #Match next line
</tr>                           #Match closing tag

Here is an alternative way, which may be more robust:

deluserconfirm.html\?user=([^"]+)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文