在元素中选择所有文本节点,而没有文字在子元素中
在抓取网站时,我有一个这样的html:
<div class="classA classB classC">
<div class="classD classE">
<h1 class="classF classD">Text I don't want</h1>
<ul>....</ul> <!-- containing more text in nested children, don't want -->
</div>
Text I want to grab.
<br>
More text I want to grab
</div>
在这里,我只能选择要抓取的文字,即代码>并防止选择文本我不想要
。我正在尝试选择这样的CSS选择器:
text = response.css('.classA:not(.classD) *::text').getall()
有人知道,在这种情况下该怎么办,我不熟悉XPATH,但请建议您是否有解决方案?
On scraping a site, I have an HTML like this:
<div class="classA classB classC">
<div class="classD classE">
<h1 class="classF classD">Text I don't want</h1>
<ul>....</ul> <!-- containing more text in nested children, don't want -->
</div>
Text I want to grab.
<br>
More text I want to grab
</div>
Here, how can I select only the text I want to grab, i.e ["Text I want to grab", "More text I want to grab"]
and prevent selecting Text I don't want
. I am trying to select using CSS selector like this:
text = response.css('.classA:not(.classD) *::text').getall()
Does anyone know, what to do in this case, I am not familiar with xpath, but please do suggest if have a solution in it?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您将要实现自己的目标。您想预防
&lt; h1 class =“ classf classd”&gt; text我不想要&lt;/h1&gt;
使用:不是正确的,但您必须从那里选择HTML的整个部分是您所需的输出意味着您必须选择&lt; div class =“ classa classb classc”&gt;
首先,您必须防止任何想要的东西。因此,CSS的表达应该像:或
通过砂壳证明:
You are about to reach your goal. You want to prevent
<h1 class="classF classD">Text I don't want</h1>
using :not that's correct but you have to select the entire portion of html from where there is your desired output meaning you have to select<div class="classA classB classC">
at first then you have to prevent whatever you want. so the css expression should be like:OR
Proven by scrapy shell: