当前位置：文江博客话题详情

Python beautifulsoup web-scraping

请求HTML没有获得完整的HTML

发布于 2025-02-05 18:54:33 字数 1108 浏览 1 评论 0 原文

我有以下

            response=requests.get(item_url,headers=headers).text
            soup=BeautifulSoup(response,'lxml')
            print(soup)
            
            product=soup.find_all('a',class_='shelfProductTile-descriptionLink')
            print(product)
            price_per_weight=soup.find_all('div',class_='shelfProductTile-cupPrice ng-star-inserted')
            print(price_per_weight)

来自URL的代码： https://www.woolworths.com.au/shop/shop/search/products?searchterm = uncle%20Tobys%20OATS%20OATS%20500G&； SORTBY = TRADERREELELELELELELELELELELELELELELELELELELELELELELELELELELELELELELVANCE

我已经尝试了LXML和HTML.Parsers and html.parsers'' t在请求html中获取上述变量的类。我还尝试过按美丽的汤find_all find_all return none 但是，仍然获得产品和Price_per_weight的空列表。

可以使用美丽的汤来刮擦此信息，还是需要使用其他工具等工具？（如果可能的话，我更喜欢不使用硒）。

原文

I have the following code

            response=requests.get(item_url,headers=headers).text
            soup=BeautifulSoup(response,'lxml')
            print(soup)
            
            product=soup.find_all('a',class_='shelfProductTile-descriptionLink')
            print(product)
            price_per_weight=soup.find_all('div',class_='shelfProductTile-cupPrice ng-star-inserted')
            print(price_per_weight)

from the url: https://www.woolworths.com.au/shop/search/products?searchTerm=uncle%20tobys%20oats%20500g&sortBy=TraderRelevance

I have tried the lxml and html.parser and don't get the classes for the variables above in the requests HTML. I have also tried using cloudscraper as per Beautiful Soup find_all return None
but still, get an empty list for both product and price_per_weight.

Can this information be scraped using beautiful soup or do I need to use another tool like scrapy? (I prefer not to use selenium if possible).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

慈悲佛祖 2025-02-12 18:54:33

您看到的数据是通过JavaScript从外部URL加载的，因此 BeautifulSoup 看不到它。要加载数据，您可以使用下一个示例：

import json
import requests

url = "https://www.woolworths.com.au/apis/ui/Search/products"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0",
}

payload = {
    "Filters": [],
    "IsSpecial": False,
    "Location": "/shop/search/products?searchTerm=uncle%20tobys%20oats%20500g&sortBy=TraderRelevance",
    "PageNumber": 1,
    "PageSize": 36,
    "SearchTerm": "uncle tobys oats 500g",
    "SortType": "TraderRelevance",
}


with requests.session() as s:
    # load cookies
    s.get(
        "https://www.woolworths.com.au/shop/search/products?searchTerm=uncle%20tobys%20oats%20500g&sortBy=TraderRelevance",
        headers=headers,
    )

    # load actual data
    data = s.post(url, json=payload, headers=headers).json()

    # uncomment to print all data:
    # print(json.dumps(data, indent=4))

    for p in data["Products"]:
        print("{:<60} {}$".format(p["DisplayName"], p["Products"][0]["Price"]))

打印：

Uncle Tobys Oats Traditional Rolled Oats Porridge 500g       4.5$
Uncle Tobys Oats Quick Oats Porridge Porridge 500g           4.5$

The data you see is loaded from external URL via JavaScript, so beautifulsoup doesn't see it. To load the data you can use next example:

import json
import requests

url = "https://www.woolworths.com.au/apis/ui/Search/products"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0",
}

payload = {
    "Filters": [],
    "IsSpecial": False,
    "Location": "/shop/search/products?searchTerm=uncle%20tobys%20oats%20500g&sortBy=TraderRelevance",
    "PageNumber": 1,
    "PageSize": 36,
    "SearchTerm": "uncle tobys oats 500g",
    "SortType": "TraderRelevance",
}


with requests.session() as s:
    # load cookies
    s.get(
        "https://www.woolworths.com.au/shop/search/products?searchTerm=uncle%20tobys%20oats%20500g&sortBy=TraderRelevance",
        headers=headers,
    )

    # load actual data
    data = s.post(url, json=payload, headers=headers).json()

    # uncomment to print all data:
    # print(json.dumps(data, indent=4))

    for p in data["Products"]:
        print("{:<60} {}quot;.format(p["DisplayName"], p["Products"][0]["Price"]))

Prints:

Uncle Tobys Oats Traditional Rolled Oats Porridge 500g       4.5$
Uncle Tobys Oats Quick Oats Porridge Porridge 500g           4.5$

回复收藏 0 原文

~没有更多了~

关于作者

打小就很酷

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

请求HTML没有获得完整的HTML

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

甲如呢乙后呢

王权女流氓

云雾

wyh2033345759

乖乖

qq_xR3jkM

友情链接

请求HTML没有获得完整的HTML

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

甲如呢乙后呢

王权女流氓

云雾

wyh2033345759

乖乖

qq_xR3jkM

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。