请求HTML没有获得完整的HTML

发布于 2025-02-05 18:54:33 字数 1108 浏览 1 评论 0 原文

我有以下

            response=requests.get(item_url,headers=headers).text
            soup=BeautifulSoup(response,'lxml')
            print(soup)
            
            product=soup.find_all('a',class_='shelfProductTile-descriptionLink')
            print(product)
            price_per_weight=soup.find_all('div',class_='shelfProductTile-cupPrice ng-star-inserted')
            print(price_per_weight)

来自URL的代码: https://www.woolworths.com.au/shop/shop/search/products?searchterm = uncle%20Tobys%20OATS%20OATS%20500G&; SORTBY = TRADERREELELELELELELELELELELELELELELELELELELELELELELELELELELELELELELVANCE

我已经尝试了LXML和HTML.Parsers and html.parsers'' t在请求html中获取上述变量的类。我还尝试过按美丽的汤find_all find_all return none 但是,仍然获得产品和Price_per_weight的空列表。

可以使用美丽的汤来刮擦此信息,还是需要使用其他工具等工具? (如果可能的话,我更喜欢不使用硒)。

I have the following code

            response=requests.get(item_url,headers=headers).text
            soup=BeautifulSoup(response,'lxml')
            print(soup)
            
            product=soup.find_all('a',class_='shelfProductTile-descriptionLink')
            print(product)
            price_per_weight=soup.find_all('div',class_='shelfProductTile-cupPrice ng-star-inserted')
            print(price_per_weight)

from the url: https://www.woolworths.com.au/shop/search/products?searchTerm=uncle%20tobys%20oats%20500g&sortBy=TraderRelevance

I have tried the lxml and html.parser and don't get the classes for the variables above in the requests HTML. I have also tried using cloudscraper as per Beautiful Soup find_all return None
but still, get an empty list for both product and price_per_weight.

Can this information be scraped using beautiful soup or do I need to use another tool like scrapy? (I prefer not to use selenium if possible).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

慈悲佛祖 2025-02-12 18:54:33

您看到的数据是通过JavaScript从外部URL加载的,因此 BeautifulSoup 看不到它。要加载数据,您可以使用下一个示例:

import json
import requests

url = "https://www.woolworths.com.au/apis/ui/Search/products"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0",
}

payload = {
    "Filters": [],
    "IsSpecial": False,
    "Location": "/shop/search/products?searchTerm=uncle%20tobys%20oats%20500g&sortBy=TraderRelevance",
    "PageNumber": 1,
    "PageSize": 36,
    "SearchTerm": "uncle tobys oats 500g",
    "SortType": "TraderRelevance",
}


with requests.session() as s:
    # load cookies
    s.get(
        "https://www.woolworths.com.au/shop/search/products?searchTerm=uncle%20tobys%20oats%20500g&sortBy=TraderRelevance",
        headers=headers,
    )

    # load actual data
    data = s.post(url, json=payload, headers=headers).json()

    # uncomment to print all data:
    # print(json.dumps(data, indent=4))

    for p in data["Products"]:
        print("{:<60} {}$".format(p["DisplayName"], p["Products"][0]["Price"]))

打印:

Uncle Tobys Oats Traditional Rolled Oats Porridge 500g       4.5$
Uncle Tobys Oats Quick Oats Porridge Porridge 500g           4.5$

The data you see is loaded from external URL via JavaScript, so beautifulsoup doesn't see it. To load the data you can use next example:

import json
import requests

url = "https://www.woolworths.com.au/apis/ui/Search/products"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0",
}

payload = {
    "Filters": [],
    "IsSpecial": False,
    "Location": "/shop/search/products?searchTerm=uncle%20tobys%20oats%20500g&sortBy=TraderRelevance",
    "PageNumber": 1,
    "PageSize": 36,
    "SearchTerm": "uncle tobys oats 500g",
    "SortType": "TraderRelevance",
}


with requests.session() as s:
    # load cookies
    s.get(
        "https://www.woolworths.com.au/shop/search/products?searchTerm=uncle%20tobys%20oats%20500g&sortBy=TraderRelevance",
        headers=headers,
    )

    # load actual data
    data = s.post(url, json=payload, headers=headers).json()

    # uncomment to print all data:
    # print(json.dumps(data, indent=4))

    for p in data["Products"]:
        print("{:<60} {}
quot;.format(p["DisplayName"], p["Products"][0]["Price"]))

Prints:

Uncle Tobys Oats Traditional Rolled Oats Porridge 500g       4.5$
Uncle Tobys Oats Quick Oats Porridge Porridge 500g           4.5$
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文