python刮刀未返回某些子域中的完整HTML代码

发布于 2025-02-02 02:17:53 字数 705 浏览 3 评论 0原文

我正在汇集沃尔玛评论刮刀，目前，它毫无问题地从大多数沃尔玛页面上刮掉了HTML。一旦我尝试刮擦一页评论，它只会带有该页面代码的一小部分，主要是来自评论和一些错误标签的文字。有人知道问题可能是什么？

import requests
headers = {
    'Accept': '*/*',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
    'Accept-Language': 'en-us',
    'Referer': 'https://www.walmart.com/',
    'sec-ch-ua-platform': 'Windows',
    }
cookie_jar = {
    '_pxvid': '35ed81e0-cb1a-11ec-aad0-504d5a625548',
}
product_num = input('Enter Product Number: ')
url2 = ('https://www.walmart.com/reviews/product/'+str(product_num))
r = requests.get(url2, headers=headers, cookies=cookie_jar, timeout=5)
print(r.text)

原文

I am throwing together a Walmart review scraper, it currently scrapes html from most Walmart pages without a problem. As soon as I try scraping a page of reviews, it only comes back with a small portion of the page's code, mainly just text from reviews and a few errant tags. Anyone know what the problem could be?

import requests
headers = {
    'Accept': '*/*',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36',
    'Accept-Language': 'en-us',
    'Referer': 'https://www.walmart.com/',
    'sec-ch-ua-platform': 'Windows',
    }
cookie_jar = {
    '_pxvid': '35ed81e0-cb1a-11ec-aad0-504d5a625548',
}
product_num = input('Enter Product Number: ')
url2 = ('https://www.walmart.com/reviews/product/'+str(product_num))
r = requests.get(url2, headers=headers, cookies=cookie_jar, timeout=5)
print(r.text)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

上课铃就是安魂曲 2025-02-09 02:17:54

正如Larsks已经评论的那样，某些内容会动态加载，例如，如果向下滚动足够远。
BeautifulSoup或请求不会加载整个页面，但是您可以使用Selenium解决此问题。

Selenium的作用是在脚本控制的Web浏览器中打开您的URL，它使您可以填写表格并向下滚动。以下是如何与BS4一起使用硒的代码示例。

from bs4 import BeautifulSoup
from selenium import webdriver

# Search on google for the driver and save it in the path below
driver = webdriver.Firefox(executable_path="C:\Program Files (x86)\geckodriver.exe")
# for Chrome it's: driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")

# Here you open the url with the reviews
driver.get("https://www.example.com")
driver.maximize_window()

# This function scrolls down to the bottom of the website
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")

# Now you can scrape the given website from your Selenium browser using:
html = driver.page_source
soup = BeautifulSoup(html)

该解决方案假设通过向下滚动页面滚动来加载评论。当然，您不必使用Beautifulsoup来刮擦网站，这是个人的喜好。让我知道它是否有帮助。

As larsks already commented, some content is loaded in dynamically, for example if you scroll down far enough.
BeautifulSoup or requests don't load the whole page, but you can solve this with Selenium.

What Selenium does is it opens your url in a script-controlled web browser, it lets you fill out forms and also scroll down. Below is a code example on how to use Selenium with BS4.

from bs4 import BeautifulSoup
from selenium import webdriver

# Search on google for the driver and save it in the path below
driver = webdriver.Firefox(executable_path="C:\Program Files (x86)\geckodriver.exe")
# for Chrome it's: driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")

# Here you open the url with the reviews
driver.get("https://www.example.com")
driver.maximize_window()

# This function scrolls down to the bottom of the website
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")

# Now you can scrape the given website from your Selenium browser using:
html = driver.page_source
soup = BeautifulSoup(html)

This solution assumes that the reviews are loaded in through scrolling down the page. Of course you don't have to use BeautifulSoup to scrape the site, it's personal preference. Let me know if it helped.

回复收藏 0 原文

~没有更多了~