scrapy 带有换行符和嵌套标签

发布于 2024-12-29 08:36:45 字数 1391 浏览 0 评论 0原文

免责声明：scrapy 新手。

我有一个带有非常不规则行的表，基本结构是：

<tr>
 <td> some text </td>
 <td> some other text </td>
 <td> yet some text </td>
</tr>

但偶尔（几百次）某些行是

<tr>
 <td> <p> some text <p> </td>
 <td> <div class="class-whateva"> <p> some other text </p></div> </td>
 <td> <span id="strange-id"> 
  <a href="somelink"> yet some text </a> 
    <span> </td>
</tr>

1 或 2 个嵌套“p”“div”和“span”的其他排列，带或不带返回行字符。

我已经使用以下形式的条件语句处理了嵌套的“span span”或“p div”或“div span”：

for row in allrows:
      if  row.select('td[2]/text()'):
            item['seconditem']=row.select('td[2]/text()').extract()
      elif row.select('td[2]/*/text()'):
            item['seconditem']=row.select('td[2]/*/text()').extract()
      elif row.select('td[2]/*/*/text()'):
            item['seconditem']=row.select('td[2]/*/*/text()').extract()

现在我有两个问题：

（1）有条件是

td[2]/*/*/text()

处理不规则嵌套行的正确方法？

(2) 我仍然缺少标签前有回车符（或换行符）的所有情况。因此，如果该行的形式为：

   <td><div>
      <p>text </p>
   </div></td>

我的所有 xpath 将返回的是 ['\n ']。有什么技巧可以捕捉换行符后面的内容吗？

任何提示表示赞赏。谢谢。

原文

Disclaimer: new to scrapy.

I have a table with pretty irregular rows, The basic structure is:

<tr>
 <td> some text </td>
 <td> some other text </td>
 <td> yet some text </td>
</tr>

but occasionally (a few hundred times) some rows are

<tr>
 <td> <p> some text <p> </td>
 <td> <div class="class-whateva"> <p> some other text </p></div> </td>
 <td> <span id="strange-id"> 
  <a href="somelink"> yet some text </a> 
    <span> </td>
</tr>

or other permutations of 1 or 2 nested "p" "div" and "span" with or without return line characters.

I've taken care of the nested "span span" or "p div" or "div span" with conditional statements of the form:

for row in allrows:
      if  row.select('td[2]/text()'):
            item['seconditem']=row.select('td[2]/text()').extract()
      elif row.select('td[2]/*/text()'):
            item['seconditem']=row.select('td[2]/*/text()').extract()
      elif row.select('td[2]/*/*/text()'):
            item['seconditem']=row.select('td[2]/*/*/text()').extract()

Now I have two questions:

(1) Is conditional

td[2]/*/*/text()

the right way to go about for irregular nested rows?

(2) I am still missing all the cases where there is a return (or newline) before the tag.
So if the row is of the form:

   <td><div>
      <p>text </p>
   </div></td>

All my xpath will return is a ['\n ']. Any trick to catch what's after the newline character?

Any tips appreciated. Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

枫林﹌晚霞¤ 2025-01-05 08:36:45

您可以在 XPath 表达式中使用 string() 函数来获取一个字符串中的所有内部文本节点：

# nested.html - your second html snippet
# $scrapy shell "nested.html" 

In [1]: row = hxs.select('//tr')

In [2]: row.select('td[2]').select('string()').extract()
Out[2]: [u'   some other text  ']

In [3]: row.select('td[2]').select('string()').extract()[0]
Out[3]: u'   some other text  '

In [4]: row.select('td[3]').select('string()').extract()[0]
Out[4]: u'  \r\n   yet some text  \r\n     '

或 //text() 获取所有内部 text 节点：

In [5]: row.select('td[3]//text()').extract()
Out[5]: [u' \r\n  ', u' yet some text ', u' \r\n    ', u' ']

和 ''.join(...) 获取字符串：

In [6]: ''.join(row.select('td[3]//text()').extract())
Out[6]: u' \r\n   yet some text  \r\n     '

You can use string() function in XPath expression to get all inner text nodes in one string:

# nested.html - your second html snippet
# $scrapy shell "nested.html" 

In [1]: row = hxs.select('//tr')

In [2]: row.select('td[2]').select('string()').extract()
Out[2]: [u'   some other text  ']

In [3]: row.select('td[2]').select('string()').extract()[0]
Out[3]: u'   some other text  '

In [4]: row.select('td[3]').select('string()').extract()[0]
Out[4]: u'  \r\n   yet some text  \r\n     '

Or //text() to get all inner text nodes:

In [5]: row.select('td[3]//text()').extract()
Out[5]: [u' \r\n  ', u' yet some text ', u' \r\n    ', u' ']

And ''.join(...) to get string:

In [6]: ''.join(row.select('td[3]//text()').extract())
Out[6]: u' \r\n   yet some text  \r\n     '

回复收藏 0 原文

~没有更多了~

关于作者

我要还你自由

暂无简介

文章

26 人气

关注发私信

卷耳

文章 0 评论 0

关注

佚名

文章 0 评论 0

关注

℉服软

文章 0 评论 0

关注

qq_2gSKZM

文章 0 评论 0

关注

凉宸

文章 0 评论 0

关注

gyhjy

文章 0 评论 0

友情链接

文江博客

scrapy 带有换行符和嵌套标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

scrapy 带有换行符和嵌套标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

卷耳

佚名

℉服软

qq_2gSKZM

凉宸

gyhjy

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。