scrapy 带有换行符和嵌套标签
免责声明:scrapy 新手。
我有一个带有非常不规则行的表,基本结构是:
<tr>
<td> some text </td>
<td> some other text </td>
<td> yet some text </td>
</tr>
但偶尔(几百次)某些行是
<tr>
<td> <p> some text <p> </td>
<td> <div class="class-whateva"> <p> some other text </p></div> </td>
<td> <span id="strange-id">
<a href="somelink"> yet some text </a>
<span> </td>
</tr>
1 或 2 个嵌套“p”“div”和“span”的其他排列,带或不带返回行字符。
我已经使用以下形式的条件语句处理了嵌套的“span span”或“p div”或“div span”:
for row in allrows:
if row.select('td[2]/text()'):
item['seconditem']=row.select('td[2]/text()').extract()
elif row.select('td[2]/*/text()'):
item['seconditem']=row.select('td[2]/*/text()').extract()
elif row.select('td[2]/*/*/text()'):
item['seconditem']=row.select('td[2]/*/*/text()').extract()
现在我有两个问题:
(1)有条件是
td[2]/*/*/text()
处理不规则嵌套行的正确方法?
(2) 我仍然缺少标签前有回车符(或换行符)的所有情况。 因此,如果该行的形式为:
<td><div>
<p>text </p>
</div></td>
我的所有 xpath 将返回的是 ['\n ']。有什么技巧可以捕捉换行符后面的内容吗?
任何提示表示赞赏。谢谢。
Disclaimer: new to scrapy.
I have a table with pretty irregular rows, The basic structure is:
<tr>
<td> some text </td>
<td> some other text </td>
<td> yet some text </td>
</tr>
but occasionally (a few hundred times) some rows are
<tr>
<td> <p> some text <p> </td>
<td> <div class="class-whateva"> <p> some other text </p></div> </td>
<td> <span id="strange-id">
<a href="somelink"> yet some text </a>
<span> </td>
</tr>
or other permutations of 1 or 2 nested "p" "div" and "span" with or without return line characters.
I've taken care of the nested "span span" or "p div" or "div span" with conditional statements of the form:
for row in allrows:
if row.select('td[2]/text()'):
item['seconditem']=row.select('td[2]/text()').extract()
elif row.select('td[2]/*/text()'):
item['seconditem']=row.select('td[2]/*/text()').extract()
elif row.select('td[2]/*/*/text()'):
item['seconditem']=row.select('td[2]/*/*/text()').extract()
Now I have two questions:
(1) Is conditional
td[2]/*/*/text()
the right way to go about for irregular nested rows?
(2) I am still missing all the cases where there is a return (or newline) before the tag.
So if the row is of the form:
<td><div>
<p>text </p>
</div></td>
All my xpath will return is a ['\n ']. Any trick to catch what's after the newline character?
Any tips appreciated. Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以在
XPath
表达式中使用string()
函数来获取一个字符串中的所有内部文本节点:或
//text()
获取所有内部text
节点:和
''.join(...)
获取字符串:You can use
string()
function inXPath
expression to get all inner text nodes in one string:Or
//text()
to get all innertext
nodes:And
''.join(...)
to get string: