无法获得“ myntra”的正确html响应。网站使用requests.get()
我在刮擦“ myntra”网站时遇到了一个问题。我正在尝试刮擦价格和可用性。使用requests.get()时,我在Localhost中获取网站的内容,但无法在Google Colab中获得。响应是LT; 200>两次,但是在COLAB中获得了网站维护HTML页面的响应,而在本地主机下,一切正常。我真的不明白发生了什么。如果有人能帮助我解决这个问题,我真的很感激。一切都在本地工作正常,但在服务器中不行
会员
s=requests.session()
url="https://www.myntra.com/jeans/levis/levis-512-men-black-slim-tapered-fit-mid-rise-clean-look-light-fade-stretchable-jeans/16612780/buy?utm_campaign=_3_&utm_medium=affiliate&utm_source=grabon"
page=s.get(url)
page.content
/buy?utm_campaign =
b'<!doctype html> <html> <head> <title>Site Maintenance</title> <style type="text/css">body { text-align: center; padding: 150px; }h1 { font-size: 40px; }body { font: 16px Helvetica, sans-serif; color: #333; }#error { display: block; text-align: left; width: 650px; margin: 0 auto; }</style> </head> <body> <div id="error"> <h1>Oops! Something went wrong</h1> <div> <hr> <p>Please contact your administrator</p> </div> </div> </body> </html>'
=
I have come across a problem while scraping "myntra" website. I am trying to scrape prices and availability. when using requests.get(), I get the content of the website in localhost but can't get it in google colab. The response is <200> both times but getting a response of site maintenance HTML page in colab whereas everything is working fine under the local host. I really can't understand what's going on. I would be really thankful if anyone could help me out with this. Everything is working fine locally but not in server
Myntra product link
My Code:
s=requests.session()
url="https://www.myntra.com/jeans/levis/levis-512-men-black-slim-tapered-fit-mid-rise-clean-look-light-fade-stretchable-jeans/16612780/buy?utm_campaign=_3_&utm_medium=affiliate&utm_source=grabon"
page=s.get(url)
page.content
Output:
b'<!doctype html> <html> <head> <title>Site Maintenance</title> <style type="text/css">body { text-align: center; padding: 150px; }h1 { font-size: 40px; }body { font: 16px Helvetica, sans-serif; color: #333; }#error { display: block; text-align: left; width: 650px; margin: 0 auto; }</style> </head> <body> <div id="error"> <h1>Oops! Something went wrong</h1> <div> <hr> <p>Please contact your administrator</p> </div> </div> </body> </html>'
getting correct content locally
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
此网络使用反机器人安全 akamai bot Manager 。
如何绕过 akamai bot Manager ?
在这种情况下,仅添加用户代理,它起作用。用户代理用于指示您使用的是真正的浏览器。
此外,此站点需要 JavaScript才能渲染所有内容,但是您可以找到脚本标签,其中包含一个带有所有信息的JSON。
在这里,您有一些工作代码:
输出:
This web use anti-bot security Akamai Bot Manager.
How to bypass Akamai Bot Manager?
In this case just adding User-Agent it works. User-agent's are used to indicate that you are using a real browser.
Also this site needs javascript to render all content, but you can find script tag that has one json with all info you need.
Here you have some working code:
OUTPUT: