Pandas Dataframe 数据清理了吗？

发布于 2025-01-12 14:40:35 字数 582 浏览 0 评论 0原文

我正在尝试清理一些我已刮入 Excel 页面的数据，但我得到了额外的信息，我想清理一下它有人可以告诉我如何确定我需要使用 pandas 删除哪个级别？

到目前为止，我的代码

soup1 = BeautifulSoup(driver.page_source,'html.parser')  
df1 = pd.read_html(str(soup1))[0]
print(df1)

提取了下面的数据。

我需要的信息以红色突出显示，其他都是我不需要的无用数据。

我不确定是否需要它，但数据是从该表中提取的。

原文

I'm attempting to clean up some data I've scraped into an excel page but I'm getting extra info and I'm wanting to clean it up a little can someone tell me how to determine what level I need to drop using pandas?

my code so far

soup1 = BeautifulSoup(driver.page_source,'html.parser')  
df1 = pd.read_html(str(soup1))[0]
print(df1)

this pulls out the data below.

the info I need is in the red highlighted everything else is useless data I don't need.

I'm not sure if it's needed but the data is being pulled from this table.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

话少心凉 2025-01-19 14:40:35

您可以尝试：
df=df.loc[df['案件编号'].notna() & (df['案件编号']!='案件编号')]

回复收藏 0 原文

葬花如无物 2025-01-19 14:40:35

首先，您需要了解 html tablet 标准结构是如何工作的，例如：

<table>
  <tr>
    <th></th>
  </tr>
  <tr>
    <td></td>
  </tr>
  <tr>
    <td></td>
  </tr>
</table>

现在，您可以使用 find_all 方法并查找与该表，但我认为最好调查 BeautifulSoup 文档并搜索在表中查找数据的正确方法。

import pandas as pd
import requests
from bs4 import BeautifulSoup

def get_table(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    
    rows = []
    for child in soup.find_all('table')[4].children: 
        row = []
        for td in child:
            try:
                row.append(td.text.replace('\n', ''))
            except:
                continue
        if len(row) > 0:
            rows.append(row)

    df = pd.DataFrame(rows[1:], columns=rows[0])
    return df

data = get_table('url')

First, you need to understand how a html tablet standard structure works, for example:

<table>
  <tr>
    <th></th>
  </tr>
  <tr>
    <td></td>
  </tr>
  <tr>
    <td></td>
  </tr>
</table>

Now, you can use find_all method and find everything related to the table, but I think it is best to investigate the BeautifulSoup documentation and search the correct way to find the data in your table.

import pandas as pd
import requests
from bs4 import BeautifulSoup

def get_table(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    
    rows = []
    for child in soup.find_all('table')[4].children: 
        row = []
        for td in child:
            try:
                row.append(td.text.replace('\n', ''))
            except:
                continue
        if len(row) > 0:
            rows.append(row)

    df = pd.DataFrame(rows[1:], columns=rows[0])
    return df

data = get_table('url')

回复收藏 0 原文

~没有更多了~