xpath获取br标签之间的文本和锚点
对我来说,我有一个非常难的问题:
这是我正在处理的 HTML:
<td>
Margaret (Parky) DeVogelaere
(<a href="/date/06-19?ref_=nmbio_sp_1">19 June</a>
2011 -
<a href="/date/08-16?ref_=nmbio_sp_1">16 August</a>
2019) (his death)
<br/>
Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a>
1975 -
2011) (divorced)
<br/>
<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a>
1961 -
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a>
1974) (divorced)
(2 children)
</td>
我如何获取 xpath 的所有内容,包括 td 开头和第一个 br/ 之间的锚点,并对第一个 br/ 之间的所有内容重复该操作下一个呢?等等。 我希望我能说清楚,我不是专业人士,只是一个业余爱好程序员
NodeValue 只提供了所有文本,但没有提供来自可能的个人锚点的 href,女巫并不总是像你所看到的那样
所以要使它更多明确这就是我想要的:
Margaret (Parky) DeVogelaere (<a href="/date/06-19?ref_=nmbio_sp_1">19
June</a> 2011 - <a href="/date/08-16?ref_=nmbio_sp_1">16
August</a> 2019) (his death)
Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a>
1975 -
2011) (divorced)
<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a>
1961 -
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a>
1974) (divorced)
(2 children)
这来自 imdb https://www.imdb.com/name/nm0001228/bio 家庭、配偶部分
我可以获取有问题的 td,但不明白我如何获取我想要的数据
$cells = $xp->query("//table[contains(@id, 'tableFamily')]/tr[1]/td[2]")
或者任何人都知道知道不同的方法吗?
I have a, for me, very hard question:
this is the HTML I'm dealing with:
<td>
Margaret (Parky) DeVogelaere
(<a href="/date/06-19?ref_=nmbio_sp_1">19 June</a>
2011 -
<a href="/date/08-16?ref_=nmbio_sp_1">16 August</a>
2019) (his death)
<br/>
Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a>
1975 -
2011) (divorced)
<br/>
<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a>
1961 -
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a>
1974) (divorced)
(2 children)
</td>
How do i get with xpath everything, including anchor between the start of td and first br/ and repeat that for everything between the first br/ and the next one? and so on.
I hope i make myself clear and I'm not a professional, just a hobby programmer
NodeValue just gives all of the text but not the href from a possible person anchor, witch is not always the case as you can see
So to make it more clear this is what i want:
Margaret (Parky) DeVogelaere (<a href="/date/06-19?ref_=nmbio_sp_1">19
June</a> 2011 - <a href="/date/08-16?ref_=nmbio_sp_1">16
August</a> 2019) (his death)
Portia Rebecca "Becky" Crockett
(<a href="/date/11-11?ref_=nmbio_sp_2">11 November</a>
1975 -
2011) (divorced)
<a href="/name/nm0108232?ref_=nmbio_sp_3">Susan Brewer </a>(<a href="/date/10-08?ref_=nmbio_sp_3">8 October</a>
1961 -
<a href="/date/04-15?ref_=nmbio_sp_3">15 April</a>
1974) (divorced)
(2 children)
This comes from imdb https://www.imdb.com/name/nm0001228/bio Family, spouse part
I can get the td in question, but don't understand how i get the data that i want
$cells = $xp->query("//table[contains(@id, 'tableFamily')]/tr[1]/td[2]")
Or does anybody know a different approach?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好吧,你的 html 有点时髦,所以获得你想要的(或接近它)的一种方法是对原始 HTML 进行一些手术(将
替换为s 以便创建可解析元素,然后使用 HTML 解析器和 xpath:
输出:
Well, your html is a litlle funky, so one way to get what you want (or close to it) is to perform some surgery on the original HTML (replacing the
<br/>
s with</td><td>
s so as to create parsable elements, and then use an HTML parser and xpath:Output: