使用Python从表中刮除产品信息

发布于 2025-02-12 20:46:47 字数 619 浏览 1 评论 0原文

``我无法用代码从表中刮擦成分。请帮助我使用我的代码。我只想成分名称作为输出。我还提供了成分表的图像。在这里，我只想要用红色圆圈标记的成分名称。 '''

url=https://mamaearth.in/product/mamaearth-me-deo-for-a-scent-that-s-unique-to-you-120-ml
table1 = soup.find('div', class_='CmsItemRevamp-sc-1moss4z-0 eQqUUy CMSContent').text.strip()
table1
mydata = pd.DataFrame(columns = headers)
for j in table1.find_all('tr')[1:]:
    row_data = j.find_all('td')
    row = [i.text for i in row_data]
    length = len(mydata)
    mydata.loc[length] = row

原文

'''I'm unable to scrape ingredients from table with my code. Please help me with my code. I want only ingredients name as a output. I've also provided the image of ingredients table. Here, I only want the ingredients names marked with a red circle.'''

url=https://mamaearth.in/product/mamaearth-me-deo-for-a-scent-that-s-unique-to-you-120-ml
table1 = soup.find('div', class_='CmsItemRevamp-sc-1moss4z-0 eQqUUy CMSContent').text.strip()
table1
mydata = pd.DataFrame(columns = headers)
for j in table1.find_all('tr')[1:]:
    row_data = j.find_all('td')
    row = [i.text for i in row_data]
    length = len(mydata)
    mydata.loc[length] = row

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

回忆那么伤 2025-02-19 20:46:48

该信息在HTML中，但使用JavaScript渲染。因此，您需要从HTML的＆lt; script＆gt;部分中的JSON中提取它。

这可以如下完成：

from bs4 import BeautifulSoup
import requests
import json

url= "https://mamaearth.in/product/mamaearth-me-deo-for-a-scent-that-s-unique-to-you-120-ml"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
script_text = soup.find('script', id="__NEXT_DATA__").string
data = json.loads(script_text)
soup_product = BeautifulSoup(data['props']['initialProps']['pageProps']['cmsContent'][5]['content'], "html.parser")

for tr in soup_product.find_all('tr'):
    print(tr.td.get_text(strip=True))   # display just the first td element

我建议您print（data）查看返回的所有可用信息。最难的部分是找到所需的JSON结构内部的位置。

这将为您提供以下输出：

Ingredient
Ethyl Alcohol (95%)
Aqua (D.M.Water)
Propylene Glycol
Perfume

注意：一些JSON值包含HTML，这就是为什么使用第二个对BeautifulSoup的调用来解析此嵌入式HTML的原因。

另一种方法是使用Selenium之类的东西来控制您的浏览器。这将在使用视图源时看到HTML。不利的一面是，它的资源较慢和资源密集得多。

要在一条线上输出成分：

ingredients = [tr.td.get_text(strip=True) for tr in soup_product.find_all('tr')][1:]        # [1:] to skip header
print(','.join(ingredients))

The information is in the HTML, but it is rendered using Javascript. As such you need to extract it yourself from JSON contained inside a <script> section of the HTML.

This could be done as follows:

from bs4 import BeautifulSoup
import requests
import json

url= "https://mamaearth.in/product/mamaearth-me-deo-for-a-scent-that-s-unique-to-you-120-ml"
req = requests.get(url)
soup = BeautifulSoup(req.content, "html.parser")
script_text = soup.find('script', id="__NEXT_DATA__").string
data = json.loads(script_text)
soup_product = BeautifulSoup(data['props']['initialProps']['pageProps']['cmsContent'][5]['content'], "html.parser")

for tr in soup_product.find_all('tr'):
    print(tr.td.get_text(strip=True))   # display just the first td element

I suggest you print(data) to see all the available information that is returned. The hardest part is finding the location inside the JSON structure for what you need.

This would give you the following output:

Ingredient
Ethyl Alcohol (95%)
Aqua (D.M.Water)
Propylene Glycol
Perfume

Note: Some of the JSON values contain HTML which is why a second call to BeautifulSoup is used to parse this embedded HTML.

An alternative approach would be to use something like selenium to control your browser. This would render the HTML as you see when using view source. The downside is it is MUCH slower and resource intensive.

To output the ingredients on one line:

ingredients = [tr.td.get_text(strip=True) for tr in soup_product.find_all('tr')][1:]        # [1:] to skip header
print(','.join(ingredients))

回复收藏 0 原文

~没有更多了~

关于作者

后知后觉

暂无简介

文章

851 人气

关注发私信

友情链接

文江博客

使用Python从表中刮除产品信息

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

空城旧梦

破晓

半仙

宫墨修音

17780639550

潮男不是我

友情链接

使用Python从表中刮除产品信息

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

空城旧梦

破晓

半仙

宫墨修音

17780639550

潮男不是我

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。