美丽的汤 - html 解析器返回点而不是网络上可见的字符串

发布于 2025-01-12 04:05:56 字数 651 浏览 4 评论 0原文

我正在尝试从以下位置获取演员数量： https://apify.com/store以下 HTML：

<div class="ActorStore-statusNbHits">
<span class="ActorStore-statusNbHitsNumber">895</span>results</div>

当我发送 get 请求并使用 BeautifulSoup 解析响应时：

r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
return soup.find("span", class_="ActorStore-statusNbHitsNumber").text

我得到三个点 ... 而不是数字 895 元素是 ...

如何获取号码？

原文

I'm trying to get the number of actors from: https://apify.com/store which is under the following HTML:

<div class="ActorStore-statusNbHits">
<span class="ActorStore-statusNbHitsNumber">895</span>results</div>

When I send get request and parse response with BeautifulSoup using:

r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
return soup.find("span", class_="ActorStore-statusNbHitsNumber").text

I get three dots ... instead of the number 895
the element is <span class="ActorStore-statusNbHitsNumber">...</span>

How can I get the number?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏日浅笑〃 2025-01-19 04:05:56

如果您在浏览器中检查网络调用（按 F12）并按 XHR 进行过滤，您将看到数据是通过以下方式动态加载的：发送 POST 请求：

您可以通过发送正确的 json 数据来模拟该请求。不需要 BeautifulSoup，您只需使用 requests 模块。

这是一个完整的工作示例：

import requests


data = {
    "query": "",
    "page": 0,
    "hitsPerPage": 24,
    "restrictSearchableAttributes": [],
    "attributesToHighlight": [],
    "attributesToRetrieve": [
        "title",
        "name",
        "username",
        "userFullName",
        "stats",
        "description",
        "pictureUrl",
        "userPictureUrl",
        "notice",
        "currentPricingInfo",
    ],
}
response = requests.post(
    "https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.12.1)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7",
    json=data,
)


print(response.json()["nbHits"])

输出：

要查看所有 JSON 数据以访问键/值对，您可以使用：

from pprint import pprint
pprint(response.json(), indent=4)

部分输出：

{   'exhaustiveNbHits': True,
    'exhaustiveTypo': True,
    'hits': [   {   'currentPricingInfo': None,
                    'description': 'Crawls arbitrary websites using the Chrome '
                                   'browser and extracts data from pages using '
                                   'a provided JavaScript code. The actor '
                                   'supports both recursive crawling and lists '
                                   'of URLs and automatically manages '
                                   'concurrency for maximum performance. This '
                                   "is Apify's basic tool for web crawling and "
                                   'scraping.',
                    'name': 'web-scraper',
                    'objectID': 'moJRLRc85AitArpNN',
                    'pictureUrl': 'https://apify-image-uploads-prod.s3.amazonaws.com/moJRLRc85AitArpNN/Zn8vbWTika7anCQMn-SD-02-02.png',
                    'stats': {   'lastRunStartedAt': '2022-03-06T21:57:00.831Z',
                                 'totalBuilds': 104,
                                 'totalMetamorphs': 102660,
                                 'totalRuns': 68036112,
                                 'totalUsers': 23492,
                                 'totalUsers30Days': 1726,
                                 'totalUsers7Days': 964,
                                 'totalUsers90Days': 3205},

If you inspect the network calls in your browser (press F12) and filter by XHR, you'll see that the data is loaded dynamically via sending a POST request:

You can mimic that request via sending the correct json data. There's no need for BeautifulSoup you can use only the requests module.

Here is a complete working example:

import requests


data = {
    "query": "",
    "page": 0,
    "hitsPerPage": 24,
    "restrictSearchableAttributes": [],
    "attributesToHighlight": [],
    "attributesToRetrieve": [
        "title",
        "name",
        "username",
        "userFullName",
        "stats",
        "description",
        "pictureUrl",
        "userPictureUrl",
        "notice",
        "currentPricingInfo",
    ],
}
response = requests.post(
    "https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.12.1)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7",
    json=data,
)


print(response.json()["nbHits"])

Output:

To view all the JSON data in order to access the key/value pairs, you can use:

from pprint import pprint
pprint(response.json(), indent=4)

Partial output:

{   'exhaustiveNbHits': True,
    'exhaustiveTypo': True,
    'hits': [   {   'currentPricingInfo': None,
                    'description': 'Crawls arbitrary websites using the Chrome '
                                   'browser and extracts data from pages using '
                                   'a provided JavaScript code. The actor '
                                   'supports both recursive crawling and lists '
                                   'of URLs and automatically manages '
                                   'concurrency for maximum performance. This '
                                   "is Apify's basic tool for web crawling and "
                                   'scraping.',
                    'name': 'web-scraper',
                    'objectID': 'moJRLRc85AitArpNN',
                    'pictureUrl': 'https://apify-image-uploads-prod.s3.amazonaws.com/moJRLRc85AitArpNN/Zn8vbWTika7anCQMn-SD-02-02.png',
                    'stats': {   'lastRunStartedAt': '2022-03-06T21:57:00.831Z',
                                 'totalBuilds': 104,
                                 'totalMetamorphs': 102660,
                                 'totalRuns': 68036112,
                                 'totalUsers': 23492,
                                 'totalUsers30Days': 1726,
                                 'totalUsers7Days': 964,
                                 'totalUsers90Days': 3205},

回复收藏 0 原文

~没有更多了~