仅 Scrapy 正文文本
我正在尝试使用 python Scrapy 仅从正文中抓取文本,但还没有任何运气。
希望一些学者能够帮助我从 标签中抓取所有文本。
I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.
Wishing some scholars might be able to help me here scraping all the text from the <body>
tag.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Scrapy 使用 XPath 表示法来提取 HTML 文档的部分内容。那么,您是否尝试过仅使用
/html/body
路径来提取? (假设它嵌套在
中)。使用
//body
选择器可能会更简单:您可以找到有关 Scrapy 提供的选择器的更多信息 此处。
Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the
/html/body
path to extract<body>
? (assuming it's nested in<html>
). It might be even simpler to use the//body
selector:You can find more information about the selectors Scrapy provides here.
最好能获得像 lynx -nolist -dump 生成的输出一样的输出,它会呈现页面,然后转储可见文本。通过提取段落元素的所有子元素的文本,我已经接近了。
我从
//body//text()
开始,它将所有文本元素拉到正文中,但这包括脚本元素。//body//p
获取正文内的所有段落元素,包括未标记文本周围的隐含段落标记。使用//body//p/text()
提取文本会丢失子标签中的元素(例如 bold、italic、span、div)。只要页面的段落中没有嵌入脚本标签,//body//p//text()
似乎就可以获得大部分所需的内容。XPath 中的
/
表示直接子代,而//
包含所有后代。用空格将字符串连接在一起,您将获得非常好的输出:
It would be nice to get output like that produced by
lynx -nolist -dump
, which renders the page and then dumps the visible text. I've gotten close by extracting the text of all children of paragraph elements.I started with
//body//text()
, which pulled all the textual elements inside the body, but this included script elements.//body//p
gets all of the paragraph elements inside the body, including the implied paragraph tag around untagged text. Extracting the text with//body//p/text()
misses elements from subtags (like bold, italic, span, div).//body//p//text()
seems to get most of the desired content, as long as the page doesn't have script tags embedded in paragraphs.in XPath
/
implies a direct child, while//
includes all descendants.Join the strings together with a space and you have a pretty good output: