xpath获取br标签之间的文本和锚点

发布于 2025-01-10 01:44:44 字数 2013 浏览 0 评论 0原文

对我来说，我有一个非常难的问题：

这是我正在处理的 HTML：

<td>
Margaret (Parky) DeVogelaere
(<a href="/date/06-19?ref_=nmbio_sp_1">19 June</a>&nbsp;
2011 - 
<a href="/date/08-16?ref_=nmbio_sp_1">16 August</a>&nbsp;
2019)&nbsp;(his death)
<br/>
Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a>&nbsp;
1975                            - 
2011)&nbsp;(divorced)
<br/>
<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a>&nbsp;
1961                            - 
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a>&nbsp;
1974)&nbsp;(divorced)
&nbsp;(2 children)
</td>

我如何获取 xpath 的所有内容，包括 td 开头和第一个 br/ 之间的锚点，并对第一个 br/ 之间的所有内容重复该操作下一个呢？等等。我希望我能说清楚，我不是专业人士，只是一个业余爱好程序员

NodeValue 只提供了所有文本，但没有提供来自可能的个人锚点的 href，女巫并不总是像你所看到的那样

所以要使它更多明确这就是我想要的：

Margaret (Parky) DeVogelaere (<a href="/date/06-19?ref_=nmbio_sp_1">19
June</a>&nbsp; 2011 -  <a href="/date/08-16?ref_=nmbio_sp_1">16
August</a>&nbsp; 2019)&nbsp;(his death)

Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a>&nbsp;
1975                            - 
2011)&nbsp;(divorced)

<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a>&nbsp;
1961                            - 
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a>&nbsp;
1974)&nbsp;(divorced)
&nbsp;(2 children)

这来自 imdb https://www.imdb.com/name/nm0001228/bio 家庭、配偶部分

我可以获取有问题的 td，但不明白我如何获取我想要的数据

$cells = $xp->query("//table[contains(@id, 'tableFamily')]/tr[1]/td[2]")

或者任何人都知道知道不同的方法吗？

原文

I have a, for me, very hard question:

this is the HTML I'm dealing with:

<td>
Margaret (Parky) DeVogelaere
(<a href="/date/06-19?ref_=nmbio_sp_1">19 June</a> 
2011 - 
<a href="/date/08-16?ref_=nmbio_sp_1">16 August</a> 
2019) (his death)
<br/>
Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a> 
1975                            - 
2011) (divorced)
<br/>
<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a> 
1961                            - 
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a> 
1974) (divorced)
 (2 children)
</td>

How do i get with xpath everything, including anchor between the start of td and first br/ and repeat that for everything between the first br/ and the next one? and so on.
I hope i make myself clear and I'm not a professional, just a hobby programmer

NodeValue just gives all of the text but not the href from a possible person anchor, witch is not always the case as you can see

So to make it more clear this is what i want:

Margaret (Parky) DeVogelaere (<a href="/date/06-19?ref_=nmbio_sp_1">19
June</a>  2011 -  <a href="/date/08-16?ref_=nmbio_sp_1">16
August</a>  2019) (his death)

Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a> 
1975                            - 
2011) (divorced)

<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a> 
1961                            - 
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a> 
1974) (divorced)
 (2 children)

This comes from imdb https://www.imdb.com/name/nm0001228/bio Family, spouse part

I can get the td in question, but don't understand how i get the data that i want

$cells = $xp->query("//table[contains(@id, 'tableFamily')]/tr[1]/td[2]")

Or does anybody know a different approach?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

雨后咖啡店 2025-01-17 01:44:44

好吧，你的 html 有点时髦，所以获得你想要的（或接近它）的一种方法是对原始 HTML 进行一些手术（将替换为s 以便创建可解析元素，然后使用 HTML 解析器和 xpath：

$orig = '
[your html above]
';
$newstring = str_replace("<br/>", "</td><td>", $orig);
$doc = new DOMDocument();
$doc->loadHTML($newstring);
$xpath = new DOMXPath($doc);
$sources = $xpath->query('//td');
foreach ($sources as $source) {
    echo $source->ownerDocument->saveHTML($source)."\r\n---------------\r\n";
};

输出：

<td>
Margaret (Parky) DeVogelaere
(<a href="/date/06-19?ref_=nmbio_sp_1">19 June</a> 
2011 - 
<a href="/date/08-16?ref_=nmbio_sp_1">16 August</a> 
2019) (his death)
</td>
---------------
<td>
Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a> 
1975                            - 
2011) (divorced)
</td>
---------------
<td>
<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a> 
1961                            - 
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a> 
1974) (divorced)
 (2 children)
</td>
---------------

Well, your html is a litlle funky, so one way to get what you want (or close to it) is to perform some surgery on the original HTML (replacing the <br/>s with </td><td>s so as to create parsable elements, and then use an HTML parser and xpath:

$orig = '
[your html above]
';
$newstring = str_replace("<br/>", "</td><td>", $orig);
$doc = new DOMDocument();
$doc->loadHTML($newstring);
$xpath = new DOMXPath($doc);
$sources = $xpath->query('//td');
foreach ($sources as $source) {
    echo $source->ownerDocument->saveHTML($source)."\r\n---------------\r\n";
};

Output:

<td>
Margaret (Parky) DeVogelaere
(<a href="/date/06-19?ref_=nmbio_sp_1">19 June</a> 
2011 - 
<a href="/date/08-16?ref_=nmbio_sp_1">16 August</a> 
2019) (his death)
</td>
---------------
<td>
Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a> 
1975                            - 
2011) (divorced)
</td>
---------------
<td>
<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a> 
1961                            - 
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a> 
1974) (divorced)
 (2 children)
</td>
---------------

回复收藏 0 原文

~没有更多了~