scrapy 带有换行符和嵌套标签

发布于 2024-12-29 08:36:45 字数 1391 浏览 0 评论 0原文

免责声明:scrapy 新手。

我有一个带有非常不规则行的表,基本结构是:

<tr>
 <td> some text </td>
 <td> some other text </td>
 <td> yet some text </td>
</tr>

但偶尔(几百次)某些行是

<tr>
 <td> <p> some text <p> </td>
 <td> <div class="class-whateva"> <p> some other text </p></div> </td>
 <td> <span id="strange-id"> 
  <a href="somelink"> yet some text </a> 
    <span> </td>
</tr>

1 或 2 个嵌套“p”“div”和“span”的其他排列,带或不带返回行字符。

我已经使用以下形式的条件语句处理了嵌套的“span span”或“p div”或“div span”:

for row in allrows:
      if  row.select('td[2]/text()'):
            item['seconditem']=row.select('td[2]/text()').extract()
      elif row.select('td[2]/*/text()'):
            item['seconditem']=row.select('td[2]/*/text()').extract()
      elif row.select('td[2]/*/*/text()'):
            item['seconditem']=row.select('td[2]/*/*/text()').extract()

现在我有两个问题:

(1)有条件是

td[2]/*/*/text()

处理不规则嵌套行的正确方法?

(2) 我仍然缺少标签前有回车符(或换行符)的所有情况。 因此,如果该行的形式为:

   <td><div>
      <p>text </p>
   </div></td>

我的所有 xpath 将返回的是 ['\n ']。有什么技巧可以捕捉换行符后面的内容吗?

任何提示表示赞赏。谢谢。

Disclaimer: new to scrapy.

I have a table with pretty irregular rows, The basic structure is:

<tr>
 <td> some text </td>
 <td> some other text </td>
 <td> yet some text </td>
</tr>

but occasionally (a few hundred times) some rows are

<tr>
 <td> <p> some text <p> </td>
 <td> <div class="class-whateva"> <p> some other text </p></div> </td>
 <td> <span id="strange-id"> 
  <a href="somelink"> yet some text </a> 
    <span> </td>
</tr>

or other permutations of 1 or 2 nested "p" "div" and "span" with or without return line characters.

I've taken care of the nested "span span" or "p div" or "div span" with conditional statements of the form:

for row in allrows:
      if  row.select('td[2]/text()'):
            item['seconditem']=row.select('td[2]/text()').extract()
      elif row.select('td[2]/*/text()'):
            item['seconditem']=row.select('td[2]/*/text()').extract()
      elif row.select('td[2]/*/*/text()'):
            item['seconditem']=row.select('td[2]/*/*/text()').extract()

Now I have two questions:

(1) Is conditional

td[2]/*/*/text()

the right way to go about for irregular nested rows?

(2) I am still missing all the cases where there is a return (or newline) before the tag.
So if the row is of the form:

   <td><div>
      <p>text </p>
   </div></td>

All my xpath will return is a ['\n ']. Any trick to catch what's after the newline character?

Any tips appreciated. Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

枫林﹌晚霞¤ 2025-01-05 08:36:45

您可以在 XPath 表达式中使用 string() 函数来获取一个字符串中的所有内部文本节点:

# nested.html - your second html snippet
# $scrapy shell "nested.html" 

In [1]: row = hxs.select('//tr')

In [2]: row.select('td[2]').select('string()').extract()
Out[2]: [u'   some other text  ']

In [3]: row.select('td[2]').select('string()').extract()[0]
Out[3]: u'   some other text  '

In [4]: row.select('td[3]').select('string()').extract()[0]
Out[4]: u'  \r\n   yet some text  \r\n     '

//text() 获取所有内部 text 节点:

In [5]: row.select('td[3]//text()').extract()
Out[5]: [u' \r\n  ', u' yet some text ', u' \r\n    ', u' ']

''.join(...) 获取字符串:

In [6]: ''.join(row.select('td[3]//text()').extract())
Out[6]: u' \r\n   yet some text  \r\n     '

You can use string() function in XPath expression to get all inner text nodes in one string:

# nested.html - your second html snippet
# $scrapy shell "nested.html" 

In [1]: row = hxs.select('//tr')

In [2]: row.select('td[2]').select('string()').extract()
Out[2]: [u'   some other text  ']

In [3]: row.select('td[2]').select('string()').extract()[0]
Out[3]: u'   some other text  '

In [4]: row.select('td[3]').select('string()').extract()[0]
Out[4]: u'  \r\n   yet some text  \r\n     '

Or //text() to get all inner text nodes:

In [5]: row.select('td[3]//text()').extract()
Out[5]: [u' \r\n  ', u' yet some text ', u' \r\n    ', u' ']

And ''.join(...) to get string:

In [6]: ''.join(row.select('td[3]//text()').extract())
Out[6]: u' \r\n   yet some text  \r\n     '
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文