在 python 中使用 BeautifulSoup 进行网页抓取

发布于 2025-01-17 09:33:19 字数 887 浏览 2 评论 0原文

如何使用 json 模块从内联脚本中提供的 JSON 格式的数据中提取价格?

我尝试在 https://glomark.lk/top-crust-bread 中提取价格/p/13676 但我无法获得价格价值。

所以请帮我解决这个问题。

import requests
import json

import sys
sys.path.insert(0,'bs4.zip')
from bs4 import BeautifulSoup

user_agent = {
                 'User-agent': 'Mozilla/5.0 Chrome/35.0.1916.47'
                 }
headers = user_agent

url = 'https://glomark.lk/top-crust-bread/p/13676'
req = requests.get(url, headers = headers)
soup = BeautifulSoup(req.content, 'html.parser')

products = soup.find_all("div", class_ = "details col-12 col-sm-12 
col-md-6 col-lg-5 col-xl-5")
for product in products:
    product_name = product.h1.text
    product_price = product.find(id = 'product-promotion-price').text
    print(product_name)
    print(product_price)

How can I use the json module to extract the price from provides the data in JSON format in an inline script?

I tried to extract the price in https://glomark.lk/top-crust-bread/p/13676
But I couldn't to get the price value.

So please help me to solve this.

import requests
import json

import sys
sys.path.insert(0,'bs4.zip')
from bs4 import BeautifulSoup

user_agent = {
                 'User-agent': 'Mozilla/5.0 Chrome/35.0.1916.47'
                 }
headers = user_agent

url = 'https://glomark.lk/top-crust-bread/p/13676'
req = requests.get(url, headers = headers)
soup = BeautifulSoup(req.content, 'html.parser')

products = soup.find_all("div", class_ = "details col-12 col-sm-12 
col-md-6 col-lg-5 col-xl-5")
for product in products:
    product_name = product.h1.text
    product_price = product.find(id = 'product-promotion-price').text
    print(product_name)
    print(product_price)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

自由如风 2025-01-24 09:33:19

您可以仅使用 requests 模块从隐藏 api 获取 json 数据(价格)。但产品名称不是动态的。

import requests
headers= {
    'content-type': 'application/json',
    'x-requested-with': 'XMLHttpRequest'
   }

api_url = "https://glomark.lk/product-page/variation-detail/13676"


jsonData = requests.post(api_url,  headers=headers).json()

price=jsonData['price']
print(price)

输出:

95

完整工作代码:

from bs4 import BeautifulSoup
import requests
headers= {
    'content-type': 'application/json',
    'x-requested-with': 'XMLHttpRequest'
   }

api_url = "https://glomark.lk/product-page/variation-detail/13676"


jsonData = requests.post(api_url,  headers=headers).json()

price=jsonData['price']



#to grab product name(not dynamic)

url = 'https://glomark.lk/top-crust-bread/p/13676'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')

title=soup.select_one('.product-title h1').text
print(title)
print(price)


 

输出:

Top Crust Bread
95
     

You can grab json data(price) from hidden api using only requests module. But the product name is not dynamic.

import requests
headers= {
    'content-type': 'application/json',
    'x-requested-with': 'XMLHttpRequest'
   }

api_url = "https://glomark.lk/product-page/variation-detail/13676"


jsonData = requests.post(api_url,  headers=headers).json()

price=jsonData['price']
print(price)

Output:

95

Full working code:

from bs4 import BeautifulSoup
import requests
headers= {
    'content-type': 'application/json',
    'x-requested-with': 'XMLHttpRequest'
   }

api_url = "https://glomark.lk/product-page/variation-detail/13676"


jsonData = requests.post(api_url,  headers=headers).json()

price=jsonData['price']



#to grab product name(not dynamic)

url = 'https://glomark.lk/top-crust-bread/p/13676'
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')

title=soup.select_one('.product-title h1').text
print(title)
print(price)


 

Output:

Top Crust Bread
95
     
木森分化 2025-01-24 09:33:19

如前所述,内容是由 JavaScript 动态提供的,因此其中一种方法可能是直接从脚本标记中获取数据,这就是您在问题中已经弄清楚的内容。

data = json.loads(soup.select_one('[type="application/ld+json"]').text)

将为您提供包含产品信息的字典:

{'@context': 'https://schema.org', '@type': 'Product', 'productID': '13676', 'name': 'Top Crust Bread', 'description': 'Top Crust Bread', 'url': '/top-crust-bread/p/13676', 'image': 'https://objectstorage.ap-mumbai-1.oraclecloud.com/n/softlogicbicloud/b/cdn/o/products/350001--01--1555692328.jpeg', 'brand': 'GLOMARK', 'offers': [{'@type': 'Offer', 'price': '95', 'priceCurrency': 'LKR', 'itemCondition': 'https://schema.org/NewCondition', 'availability': 'https://schema.org/InStock'}]}

只需选择需要的信息,例如价格:

data['offers'][0]['price']

示例

import requests, json
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://glomark.lk/top-crust-bread/p/13676'
response = requests.get(url)
soup = BeautifulSoup(response.content)

data = json.loads(soup.select_one('[type="application/ld+json"]').text)

product_price = data['offers'][0]['price']
product_name = data['name']
product_image = data['image']

print(product_name)
print(product_price)
print(product_image)

输出

Top Crust Bread 
95 
https://objectstorage.ap-mumbai-1.oraclecloud.com/n/softlogicbicloud/b/cdn/o/products/350001--01--1555692328.jpeg

As mentioned content is provided dynamically by JavaScript so one of the approaches could be to grab the data directly from the script tag, what you already figured out in your question.

data = json.loads(soup.select_one('[type="application/ld+json"]').text)

will give you a dict with product information:

{'@context': 'https://schema.org', '@type': 'Product', 'productID': '13676', 'name': 'Top Crust Bread', 'description': 'Top Crust Bread', 'url': '/top-crust-bread/p/13676', 'image': 'https://objectstorage.ap-mumbai-1.oraclecloud.com/n/softlogicbicloud/b/cdn/o/products/350001--01--1555692328.jpeg', 'brand': 'GLOMARK', 'offers': [{'@type': 'Offer', 'price': '95', 'priceCurrency': 'LKR', 'itemCondition': 'https://schema.org/NewCondition', 'availability': 'https://schema.org/InStock'}]}

simply pick information is needed like price:

data['offers'][0]['price']

Example

import requests, json
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://glomark.lk/top-crust-bread/p/13676'
response = requests.get(url)
soup = BeautifulSoup(response.content)

data = json.loads(soup.select_one('[type="application/ld+json"]').text)

product_price = data['offers'][0]['price']
product_name = data['name']
product_image = data['image']

print(product_name)
print(product_price)
print(product_image)

Output

Top Crust Bread 
95 
https://objectstorage.ap-mumbai-1.oraclecloud.com/n/softlogicbicloud/b/cdn/o/products/350001--01--1555692328.jpeg
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文