可以使用请求模块从静态网页中刮擦信息
我正在尝试获取产品标题
,并且它是网页使用请求模块。标题和描述似乎是静态的,因为它们都存在于页面源中。但是,我没有尝试使用以下尝试来抓住它们。脚本访问attributeError
此刻。
import requests
from bs4 import BeautifulSoup
link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
product_title = soup.select_one("h1[itemProp='name']").text
product_desc = soup.select_one("#product-page-selling-statement").text
print(product_title,product_desc)
如何使用请求模块从上面的页面刮擦标题和描述?
I'm trying to fetch product title
and it's description
from a webpage using requests module. The title and description appear to be static as they both are present in page source. However, I failed to grab them using following attempt. The script throws AttributeError
at this moment.
import requests
from bs4 import BeautifulSoup
link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
with requests.Session() as s:
s.headers.update(headers)
res = s.get(link)
soup = BeautifulSoup(res.text,"lxml")
product_title = soup.select_one("h1[itemProp='name']").text
product_desc = soup.select_one("#product-page-selling-statement").text
print(product_title,product_desc)
How can I scrape title and description from above pages using requests module?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
页面是动态的。追求来自API源的数据:
输出:
The page is dynamic. go after the data from the api source:
Output:
测试此类请求时,您应该输出响应,以查看您的回复。最好使用Postman之类的东西(我认为VSCODE现在具有与之相似的功能)来设置URL,标题,方法和参数,并且还可以看到带标头的完整响应。当您将所有操作都正确时,只需将其转换为Python代码即可。 Postman甚至对普通语言具有一些“导出对代码”功能。
无论如何...
我尝试了您对Postman的请求,并得到了此答复:
data:image/s3,"s3://crabby-images/bb178/bb1781358026647c2ce4c7c213176f6b223c1c51" alt=""
python与浏览器完成的请求是同一件事。如果标题,URL和参数是相同的,则应收到相同的响应。因此,下一步是比较您的请求与浏览器所做的请求之间的差异:
data:image/s3,"s3://crabby-images/bb178/bb1781358026647c2ce4c7c213176f6b223c1c51" alt=""
因此,浏览器包含的一个或多个标题可以从服务器中获得良好的响应,但是只需使用
用户代理
是不够的。我会尝试确定哪些标题,但不幸的是,Nordstrom检测到了一些“异常活动”,并且似乎已阻止了我的IP :(
data:image/s3,"s3://crabby-images/bb178/bb1781358026647c2ce4c7c213176f6b223c1c51" alt=""
可能是由于发送明显的手工要求。我认为这是我的IP被阻止,因为即使清除了缓存,我也无法从任何浏览器访问该网站。
如此仔细检查,在与刮板一起工作时,您没有发生过同样的情况。
祝你好运!
When testing requests like these you should output the response to see what you're getting back. Best to use something like Postman (I think VSCode has a similar function to it now) to set up URLs, headers, methods, and parameters, and to also see the full response with headers. When you have everything working right, just convert it to python code. Postman even has some 'export to code' functions for common languages.
Anyways...
I tried your request on Postman and got this response:
data:image/s3,"s3://crabby-images/64b2e/64b2e07162c00dc14c1e406c8723b0be9ffe2790" alt="simple response body"
Requests done from python vs a browser are the same thing. If the headers, URLs, and parameters are identical, they should receive identical responses. So the next step is comparing the difference between your request and the request done by the browser:
data:image/s3,"s3://crabby-images/9ad69/9ad69d37e9702a7f2e98f3ee14bab2adfccea900" alt="browser request"
So one or more of the headers included by the browser gets a good response from the server, but just using
User-Agent
is not enough.I would try to identify which headers, but unfortunately, Nordstrom detected some 'unusual activity' and seems to have blocked my IP :(
data:image/s3,"s3://crabby-images/25660/25660081e8c5a6280cdb0a38c9b808e6ee364219" alt="blocked"
Probably due to sending an obvious handmade request. I think it's my IP that's blocked since I can't access the site from any browser, even after clearing my cache.
So double-check that the same hasn't happened to you while working with your scraper.
Best of luck!