xpath获取br标签之间的文本和锚点

发布于 2025-01-10 01:44:44 字数 2013 浏览 0 评论 0原文

对我来说,我有一个非常难的问题:

这是我正在处理的 HTML:

<td>
Margaret (Parky) DeVogelaere
(<a href="/date/06-19?ref_=nmbio_sp_1">19 June</a>&nbsp;
2011 - 
<a href="/date/08-16?ref_=nmbio_sp_1">16 August</a>&nbsp;
2019)&nbsp;(his death)
<br/>
Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a>&nbsp;
1975                            - 
2011)&nbsp;(divorced)
<br/>
<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a>&nbsp;
1961                            - 
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a>&nbsp;
1974)&nbsp;(divorced)
&nbsp;(2 children)
</td>

我如何获取 xpath 的所有内容,包括 td 开头和第一个 br/ 之间的锚点,并对第一个 br/ 之间的所有内容重复该操作下一个呢?等等。 我希望我能说清楚,我不是专业人士,只是一个业余爱好程序员

NodeValue 只提供了所有文本,但没有提供来自可能的个人锚点的 href,女巫并不总是像你所看到的那样

所以要使它更多明确这就是我想要的:

Margaret (Parky) DeVogelaere (<a href="/date/06-19?ref_=nmbio_sp_1">19
June</a>&nbsp; 2011 -  <a href="/date/08-16?ref_=nmbio_sp_1">16
August</a>&nbsp; 2019)&nbsp;(his death)

Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a>&nbsp;
1975                            - 
2011)&nbsp;(divorced)

<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a>&nbsp;
1961                            - 
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a>&nbsp;
1974)&nbsp;(divorced)
&nbsp;(2 children)

这来自 imdb https://www.imdb.com/name/nm0001228/bio 家庭、配偶部分

我可以获取有问题的 td,但不明白我如何获取我想要的数据

$cells = $xp->query("//table[contains(@id, 'tableFamily')]/tr[1]/td[2]")

或者任何人都知道知道不同的方法吗?

I have a, for me, very hard question:

this is the HTML I'm dealing with:

<td>
Margaret (Parky) DeVogelaere
(<a href="/date/06-19?ref_=nmbio_sp_1">19 June</a> 
2011 - 
<a href="/date/08-16?ref_=nmbio_sp_1">16 August</a> 
2019) (his death)
<br/>
Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a> 
1975                            - 
2011) (divorced)
<br/>
<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a> 
1961                            - 
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a> 
1974) (divorced)
 (2 children)
</td>

How do i get with xpath everything, including anchor between the start of td and first br/ and repeat that for everything between the first br/ and the next one? and so on.
I hope i make myself clear and I'm not a professional, just a hobby programmer

NodeValue just gives all of the text but not the href from a possible person anchor, witch is not always the case as you can see

So to make it more clear this is what i want:

Margaret (Parky) DeVogelaere (<a href="/date/06-19?ref_=nmbio_sp_1">19
June</a>  2011 -  <a href="/date/08-16?ref_=nmbio_sp_1">16
August</a>  2019) (his death)

Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a> 
1975                            - 
2011) (divorced)

<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a> 
1961                            - 
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a> 
1974) (divorced)
 (2 children)

This comes from imdb https://www.imdb.com/name/nm0001228/bio Family, spouse part

I can get the td in question, but don't understand how i get the data that i want

$cells = $xp->query("//table[contains(@id, 'tableFamily')]/tr[1]/td[2]")

Or does anybody know a different approach?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

雨后咖啡店 2025-01-17 01:44:44

好吧,你的 html 有点时髦,所以获得你想要的(或接近它)的一种方法是对原始 HTML 进行一些手术(将
替换为s 以便创建可解析元素,然后使用 HTML 解析器和 xpath:

$orig = '
[your html above]
';
$newstring = str_replace("<br/>", "</td><td>", $orig);
$doc = new DOMDocument();
$doc->loadHTML($newstring);
$xpath = new DOMXPath($doc);
$sources = $xpath->query('//td');
foreach ($sources as $source) {
    echo $source->ownerDocument->saveHTML($source)."\r\n---------------\r\n";
};

输出:

<td>
Margaret (Parky) DeVogelaere
(<a href="/date/06-19?ref_=nmbio_sp_1">19 June</a> 
2011 - 
<a href="/date/08-16?ref_=nmbio_sp_1">16 August</a> 
2019) (his death)
</td>
---------------
<td>
Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a> 
1975                            - 
2011) (divorced)
</td>
---------------
<td>
<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a> 
1961                            - 
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a> 
1974) (divorced)
 (2 children)
</td>
---------------

Well, your html is a litlle funky, so one way to get what you want (or close to it) is to perform some surgery on the original HTML (replacing the <br/>s with </td><td>s so as to create parsable elements, and then use an HTML parser and xpath:

$orig = '
[your html above]
';
$newstring = str_replace("<br/>", "</td><td>", $orig);
$doc = new DOMDocument();
$doc->loadHTML($newstring);
$xpath = new DOMXPath($doc);
$sources = $xpath->query('//td');
foreach ($sources as $source) {
    echo $source->ownerDocument->saveHTML($source)."\r\n---------------\r\n";
};

Output:

<td>
Margaret (Parky) DeVogelaere
(<a href="/date/06-19?ref_=nmbio_sp_1">19 June</a> 
2011 - 
<a href="/date/08-16?ref_=nmbio_sp_1">16 August</a> 
2019) (his death)
</td>
---------------
<td>
Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a> 
1975                            - 
2011) (divorced)
</td>
---------------
<td>
<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a> 
1961                            - 
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a> 
1974) (divorced)
 (2 children)
</td>
---------------
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文