如何用美丽的汤循环以将表文本放在数据框中(Python)

发布于 2025-02-06 06:18:22 字数 2108 浏览 1 评论 0原文

这是我想从: https://churchdwight.com/ingredient-disclosure/antiperspirant-deodorant/40002569-ultramax-clear-clear-gel-gel-gel-cool-blast.aspx.aspx

这是我的代码:

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' #从CHD网站上划出数据表 #load CHD网站HTML代码 result = requests.get(current_url,verify = false,headers = {'user-agent':“ magic browser”})

#Check and see if the page successfully loaded
result_status = result.status_code
                      
if result.status_code == 200:
                      
    #Extract the HTML code and pass it through beautiful soup
    source = result.content
    document = BeautifulSoup(source, 'lxml')

    #Since each page has one table for each product, we can use the table attribute to find the table
    check = 0
    table = document.find("table")
    
    while check <= 0:
        
        #Check to make sure that you got the right table by checking whether the text within the first header title is 'INGREDIENT'
        if table.find("span").get_text() == "INGREDIENT NAME":
            check += 1
        else:
            table = table.find_next("table")
            

    #Since HTML uses tr for rows, we can use find all to get our rows
    rows = table.find_all('span', style ='font-size:13px;font-family:"Arial",sans-serif;')
        
    
    #Loop through the rows
    for row in rows[3:]:
        bar = row.find('span', style ='font-size:13px;font-family:"Arial",sans-serif;')
        bar_text = row.get_text(strip = True)
        cells_names.append(bar_text)
        
    
    data_pandas = pd.DataFrame(cells_names, columns = ['Ingredients'])
    return data_pandas
   

else:
    #Print out an error if the result status is not 200
    print("Status error" + "  " + str(result_status) + "has occurred!")

'''

我在数据框架中缺少润滑剂/乳化器,我认为这是因为跨度是跨度风格有一个额外的说法颜色:黑色;背景:白色的

任何帮助将不胜感激!!!

Here is the link to the page I am trying to scrape from: https://churchdwight.com/ingredient-disclosure/antiperspirant-deodorant/40002569-ultramax-clear-gel-cool-blast.aspx

Here is my code:

'''
#Scraping a Data Table from the CHD Website
#Load CHD Website HTML code
result = requests.get(current_url, verify=False, headers={'User-Agent' : "Magic Browser"})

#Check and see if the page successfully loaded
result_status = result.status_code
                      
if result.status_code == 200:
                      
    #Extract the HTML code and pass it through beautiful soup
    source = result.content
    document = BeautifulSoup(source, 'lxml')

    #Since each page has one table for each product, we can use the table attribute to find the table
    check = 0
    table = document.find("table")
    
    while check <= 0:
        
        #Check to make sure that you got the right table by checking whether the text within the first header title is 'INGREDIENT'
        if table.find("span").get_text() == "INGREDIENT NAME":
            check += 1
        else:
            table = table.find_next("table")
            

    #Since HTML uses tr for rows, we can use find all to get our rows
    rows = table.find_all('span', style ='font-size:13px;font-family:"Arial",sans-serif;')
        
    
    #Loop through the rows
    for row in rows[3:]:
        bar = row.find('span', style ='font-size:13px;font-family:"Arial",sans-serif;')
        bar_text = row.get_text(strip = True)
        cells_names.append(bar_text)
        
    
    data_pandas = pd.DataFrame(cells_names, columns = ['Ingredients'])
    return data_pandas
   

else:
    #Print out an error if the result status is not 200
    print("Status error" + "  " + str(result_status) + "has occurred!")

'''

I am getting missing the lubricant/emulsifer in my data frame and I think it is because the span style has an extra bit saying color:black;background:white

Any help would be much appreciated!!!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

来世叙缘 2025-02-13 06:18:22

您可以仅使用pandas获取表数据

import pandas as pd
df =pd.read_html('https://churchdwight.com/ingredient-disclosure/antiperspirant-deodorant/40002569-ultramax-clear-gel-cool-blast.aspx')[2]
print(df)

输出:

0                            INGREDIENT NAME                            FUNCTION
1                                      Water                             Solvent
2                         Cyclopentasiloxane                Lubricant/emulsifier
3                              SD Alcohol 40                        Drying agent
4                           Propylene glycol                           Humectant
5                                Dimethicone                     Skin protectant
6                  PEG/PPG-18/18 dimethicone                          Emulsifier
7           Sodium bicarbonate (baking soda)                          Deodorizer
8                                  Fragrance                           Fragrance
9  Aluminium zirconium tetrachlorohydrex gly  Active ingredient - antiperspirant

You can use only pandas to grab table data

import pandas as pd
df =pd.read_html('https://churchdwight.com/ingredient-disclosure/antiperspirant-deodorant/40002569-ultramax-clear-gel-cool-blast.aspx')[2]
print(df)

Output:

0                            INGREDIENT NAME                            FUNCTION
1                                      Water                             Solvent
2                         Cyclopentasiloxane                Lubricant/emulsifier
3                              SD Alcohol 40                        Drying agent
4                           Propylene glycol                           Humectant
5                                Dimethicone                     Skin protectant
6                  PEG/PPG-18/18 dimethicone                          Emulsifier
7           Sodium bicarbonate (baking soda)                          Deodorizer
8                                  Fragrance                           Fragrance
9  Aluminium zirconium tetrachlorohydrex gly  Active ingredient - antiperspirant
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文