python刮刀未返回某些子域中的完整HTML代码
我正在汇集沃尔玛评论刮刀,目前,它毫无问题地从大多数沃尔玛页面上刮掉了HTML。一旦我尝试刮擦一页评论,它只会带有该页面代码的一小部分,主要是来自评论和一些错误标签的文字。有人知道问题可能是什么?
import requests
headers = {
'Accept': '*/*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
'Accept-Language': 'en-us',
'Referer': 'https://www.walmart.com/',
'sec-ch-ua-platform': 'Windows',
}
cookie_jar = {
'_pxvid': '35ed81e0-cb1a-11ec-aad0-504d5a625548',
}
product_num = input('Enter Product Number: ')
url2 = ('https://www.walmart.com/reviews/product/'+str(product_num))
r = requests.get(url2, headers=headers, cookies=cookie_jar, timeout=5)
print(r.text)
I am throwing together a Walmart review scraper, it currently scrapes html from most Walmart pages without a problem. As soon as I try scraping a page of reviews, it only comes back with a small portion of the page's code, mainly just text from reviews and a few errant tags. Anyone know what the problem could be?
import requests
headers = {
'Accept': '*/*',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
'Accept-Language': 'en-us',
'Referer': 'https://www.walmart.com/',
'sec-ch-ua-platform': 'Windows',
}
cookie_jar = {
'_pxvid': '35ed81e0-cb1a-11ec-aad0-504d5a625548',
}
product_num = input('Enter Product Number: ')
url2 = ('https://www.walmart.com/reviews/product/'+str(product_num))
r = requests.get(url2, headers=headers, cookies=cookie_jar, timeout=5)
print(r.text)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
正如Larsks已经评论的那样,某些内容会动态加载,例如,如果向下滚动足够远。
BeautifulSoup或请求不会加载整个页面,但是您可以使用Selenium解决此问题。
Selenium的作用是在脚本控制的Web浏览器中打开您的URL,它使您可以填写表格并向下滚动。以下是如何与BS4一起使用硒的代码示例。
该解决方案假设通过向下滚动页面滚动来加载评论。当然,您不必使用Beautifulsoup来刮擦网站,这是个人的喜好。让我知道它是否有帮助。
As larsks already commented, some content is loaded in dynamically, for example if you scroll down far enough.
BeautifulSoup or requests don't load the whole page, but you can solve this with Selenium.
What Selenium does is it opens your url in a script-controlled web browser, it lets you fill out forms and also scroll down. Below is a code example on how to use Selenium with BS4.
This solution assumes that the reviews are loaded in through scrolling down the page. Of course you don't have to use BeautifulSoup to scrape the site, it's personal preference. Let me know if it helped.