仅 Scrapy 正文文本

发布于 2024-10-26 06:37:00 字数 94 浏览 4 评论 0原文

我正在尝试使用 python Scrapy 仅从正文中抓取文本,但还没有任何运气。

希望一些学者能够帮助我从 标签中抓取所有文本。

I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.

Wishing some scholars might be able to help me here scraping all the text from the <body> tag.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

情独悲 2024-11-02 06:37:00

Scrapy 使用 XPath 表示法来提取 HTML 文档的部分内容。那么,您是否尝试过仅使用 /html/body 路径来提取 ? (假设它嵌套在 中)。使用 //body 选择器可能会更简单:

x.select("//body").extract()    # extract body

您可以找到有关 Scrapy 提供的选择器的更多信息 此处

Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the /html/body path to extract <body>? (assuming it's nested in <html>). It might be even simpler to use the //body selector:

x.select("//body").extract()    # extract body

You can find more information about the selectors Scrapy provides here.

千仐 2024-11-02 06:37:00

最好能获得像 lynx -nolist -dump 生成的输出一样的输出,它会呈现页面,然后转储可见文本。通过提取段落元素的所有子元素的文本,我已经接近了。

我从 //body//text() 开始,它将所有文本元素拉到正文中,但这包括脚本元素。 //body//p 获取正文内的所有段落元素,包括未标记文本周围的隐含段落标记。使用 //body//p/text() 提取文本会丢失子标签中的元素(例如 bolditalic、span、div)。只要页面的段落中没有嵌入脚本标签, //body//p//text() 似乎就可以获得大部分所需的内容。

XPath 中的 / 表示直接子代,而 // 包含所有后代。

% scrapy shell
In[1]: fetch('http://stackoverflow.com/questions/5390133/scrapy-body-text-only')
In[2]: hxs.select('//body//p//text()').extract()

Out[2]:
[u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.",
u'Wishing some scholars might be able to help me here scraping all the text from the ',
u'<body>',
u' tag.',
u'Thank you in advance for your time.',
u'Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the ',
u'/html/body',
u' path to extract ',
u'<body>',
u"? (assuming it's nested in ",
u'<html>',
u'). It might be even simpler to use the ',
u'//body',
u' selector:',
u'You can find more information about the selectors Scrapy provides ',
u'here',

用空格将字符串连接在一起,您将获得非常好的输出:

In [43]: ' '.join(hxs.select("//body//p//text()").extract())
Out[43]: u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the  <body>  tag. Thank you in advance for your time. Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the  /html/body  path to extract  <body> ? (assuming it's nested in  <html> ). It might be even simpler to use the  //body  selector: You can find more information about the selectors Scrapy provides  here . This is a collaboratively edited question and answer site for  professional and enthusiast programmers . It's 100% free, no registration required. about \xbb \xa0\xa0\xa0 faq \xbb \r\n             tagged asked 1 year ago viewed 280 times active 1 year ago"

It would be nice to get output like that produced by lynx -nolist -dump, which renders the page and then dumps the visible text. I've gotten close by extracting the text of all children of paragraph elements.

I started with //body//text(), which pulled all the textual elements inside the body, but this included script elements. //body//p gets all of the paragraph elements inside the body, including the implied paragraph tag around untagged text. Extracting the text with //body//p/text() misses elements from subtags (like bold, italic, span, div). //body//p//text() seems to get most of the desired content, as long as the page doesn't have script tags embedded in paragraphs.

in XPath / implies a direct child, while // includes all descendants.

% scrapy shell
In[1]: fetch('http://stackoverflow.com/questions/5390133/scrapy-body-text-only')
In[2]: hxs.select('//body//p//text()').extract()

Out[2]:
[u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet.",
u'Wishing some scholars might be able to help me here scraping all the text from the ',
u'<body>',
u' tag.',
u'Thank you in advance for your time.',
u'Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the ',
u'/html/body',
u' path to extract ',
u'<body>',
u"? (assuming it's nested in ",
u'<html>',
u'). It might be even simpler to use the ',
u'//body',
u' selector:',
u'You can find more information about the selectors Scrapy provides ',
u'here',

Join the strings together with a space and you have a pretty good output:

In [43]: ' '.join(hxs.select("//body//p//text()").extract())
Out[43]: u"I am trying to scrape the text only from body using python Scrapy, but haven't had any luck yet. Wishing some scholars might be able to help me here scraping all the text from the  <body>  tag. Thank you in advance for your time. Scrapy uses XPath notation to extract parts of a HTML document. So, have you tried just using the  /html/body  path to extract  <body> ? (assuming it's nested in  <html> ). It might be even simpler to use the  //body  selector: You can find more information about the selectors Scrapy provides  here . This is a collaboratively edited question and answer site for  professional and enthusiast programmers . It's 100% free, no registration required. about \xbb \xa0\xa0\xa0 faq \xbb \r\n             tagged asked 1 year ago viewed 280 times active 1 year ago"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文