如果下载 HTML 时没有出现 URL,如何抓取 URL? JavaScript 可能是一个问题
我正在尝试抓取此主页的一些网址(www.globo.com)。我可以获得标题和其他网址。但其中一些不在 HTML 上,无法使用 requests 和 lxml 进行抓取。我不想使用 selenium/bs4/beautifulsoap 因为代码将在 Heroku 服务器上运行,所以这会让一切变得更加困难。
我想要抓取的 URL 位于包含以下两个类的 div 之后:container 和 false。这是强制性的。我可以轻松抓取 div 上没有类“false”的其他 URL。
尽管存在这个问题,有人知道如何抓取 URL 吗?或者有人推荐其他库来完成此任务(不是 bs4 或 selenium)?
import requests
import lxml.html
url = 'https://www.globo.com/'
page = requests.get(url)
doc = lxml.html.fromstring(page.content)
urls = doc.xpath('//div[@class="container false"]//a/@href')
print(urls)
这也不起作用:
import requests
import lxml.html
url = 'https://www.globo.com/'
page = requests.get(url)
doc = lxml.html.fromstring(page.content)
urls = doc.xpath('//div[contains(@class, "container") and contains(@class, "false")]//a/@href')
print(urls)
谢谢
I am trying to scrape some URLs of this homepage (www.globo.com). I can get the headline and others URLs. But some of them aren't on the HTML and couldn't be scraped with requests and lxml. I don't want to use selenium/bs4/beautifulsoap because the code will be running on Heroku server, so it would make everything more difficult.
The URLs that I want to scrape are after a div with these two classes: container and false. This is mandatory. Others URLs without the class "false" on the div I can easily scrape.
Does anyone know how to scrape the URLs despite this problem? Or does someone recommend other library to this task (not bs4 or selenium)?
import requests
import lxml.html
url = 'https://www.globo.com/'
page = requests.get(url)
doc = lxml.html.fromstring(page.content)
urls = doc.xpath('//div[@class="container false"]//a/@href')
print(urls)
This also doesn't work:
import requests
import lxml.html
url = 'https://www.globo.com/'
page = requests.get(url)
doc = lxml.html.fromstring(page.content)
urls = doc.xpath('//div[contains(@class, "container") and contains(@class, "false")]//a/@href')
print(urls)
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
事实证明,“丢失”的 URL 实际上位于源代码中,但您需要进行一些挖掘。
基本上,这些是由 JS 从嵌入的 JSON 加载的。您可以定位 JSON 所在的 div,并提取给定列的所有数据。
下面是如何做到这一点:
这应该产生:
注意:有些项目有 ID 但没有 URL,这些通常是小部件。因此,
try- except
。Turns out that the "missing" URL's are actually in the source but you need to do a bit of digging.
Basically, these are loaded by JS from an embedded JSON. You can target the divs the JSON sits in and extract all the data for a given column.
Here's how to do that:
This should produce:
NOTE: Some items have an ID but don't have an URL, these are usually widgets. Hence, the
try-except
.