在元素中选择所有文本节点,而没有文字在子元素中

发布于 2025-02-11 01:07:28 字数 587 浏览 1 评论 0原文

在抓取网站时,我有一个这样的html:

<div class="classA classB classC">
  <div class="classD classE">
    <h1 class="classF classD">Text I don't want</h1>
    <ul>....</ul> <!-- containing more text in nested children, don't want -->
  </div>
  Text I want to grab.
  <br>
  More text I want to grab
</div>

在这里,我只能选择要抓取的文字,即代码>并防止选择文本我不想要。我正在尝试选择这样的CSS选择器:

text = response.css('.classA:not(.classD) *::text').getall()

有人知道,在这种情况下该怎么办,我不熟悉XPATH,但请建议您是否有解决方案?

On scraping a site, I have an HTML like this:

<div class="classA classB classC">
  <div class="classD classE">
    <h1 class="classF classD">Text I don't want</h1>
    <ul>....</ul> <!-- containing more text in nested children, don't want -->
  </div>
  Text I want to grab.
  <br>
  More text I want to grab
</div>

Here, how can I select only the text I want to grab, i.e ["Text I want to grab", "More text I want to grab"] and prevent selecting Text I don't want. I am trying to select using CSS selector like this:

text = response.css('.classA:not(.classD) *::text').getall()

Does anyone know, what to do in this case, I am not familiar with xpath, but please do suggest if have a solution in it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

昔日梦未散 2025-02-18 01:07:28

您将要实现自己的目标。您想预防&lt; h1 class =“ classf classd”&gt; text我不想要&lt;/h1&gt;使用:不是正确的,但您必须从那里选择HTML的整个部分是您所需的输出意味着您必须选择&lt; div class =“ classa classb classc”&gt;首先,您必须防止任何想要的东西。因此,CSS的表达应该像:

response.css('div.classA.classB.classC:not(.classF)::text').getall()

' '.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])

通过砂壳证明:

In [1]: from scrapy.selector import Selector

In [2]: %paste

html='''
<div class="classA classB classC">
  <div class="classD classE">
    <h1 class="classF classD">Text I don't want</h1>
    <ul>....</ul> <!-- containing more text in nested children, don't want -->
  </div>
  Text I want to grab.
  <br>
  More text I want to grab
</div>
'''

## -- End pasted text --

In [3]: resp=Selector(text=html)

In [4]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).strip()
Out[4]: 'Text I want to grab.\n  \n  More text I want to grab'

In [5]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).replace('\n','' 
   ...: ).strip()
Out[5]: 'Text I want to grab.    More text I want to grab'

In [6]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).strip().replace 
   ...: ('\n','').strip()
Out[6]: 'Text I want to grab.    More text I want to grab'

Out[7]: ['', 'Text I want to grab.', 'More text I want to grab']

In [8]: ''.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getal
   ...: l()])
Out[8]: 'Text I want to grab.More text I want to grab'

In [9]: ''.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])
Out[9]: 'Text I want to grab.More text I want to grab'

In [10]: ' '.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])        
Out[10]: ' Text I want to grab. More text I want to grab'

You are about to reach your goal. You want to prevent <h1 class="classF classD">Text I don't want</h1> using :not that's correct but you have to select the entire portion of html from where there is your desired output meaning you have to select <div class="classA classB classC"> at first then you have to prevent whatever you want. so the css expression should be like:

response.css('div.classA.classB.classC:not(.classF)::text').getall()

OR

' '.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])

Proven by scrapy shell:

In [1]: from scrapy.selector import Selector

In [2]: %paste

html='''
<div class="classA classB classC">
  <div class="classD classE">
    <h1 class="classF classD">Text I don't want</h1>
    <ul>....</ul> <!-- containing more text in nested children, don't want -->
  </div>
  Text I want to grab.
  <br>
  More text I want to grab
</div>
'''

## -- End pasted text --

In [3]: resp=Selector(text=html)

In [4]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).strip()
Out[4]: 'Text I want to grab.\n  \n  More text I want to grab'

In [5]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).replace('\n','' 
   ...: ).strip()
Out[5]: 'Text I want to grab.    More text I want to grab'

In [6]: ''.join(resp.css('div.classA.classB.classC:not(.classF)::text').getall()).strip().replace 
   ...: ('\n','').strip()
Out[6]: 'Text I want to grab.    More text I want to grab'

Out[7]: ['', 'Text I want to grab.', 'More text I want to grab']

In [8]: ''.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getal
   ...: l()])
Out[8]: 'Text I want to grab.More text I want to grab'

In [9]: ''.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])
Out[9]: 'Text I want to grab.More text I want to grab'

In [10]: ' '.join([x.strip() for x in resp.css('div.classA.classB.classC:not(.classF)::text').getall()])        
Out[10]: ' Text I want to grab. More text I want to grab'
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文