大数据集的 pandas 中的 drop_duplicates

发布于 2025-01-13 22:03:56 字数 2736 浏览 2 评论 0原文

我是熊猫新手,很抱歉我的天真。

我有两个数据框。 一个是out.hdf

999999  2014    1   2   15  19  45.19   14.095  -91.528 69.7    4.5 0.0 0.0 0.0 603879074
999999  2014    1   2   23  53  57.58   16.128  -97.815 23.2    4.8 0.0 0.0 0.0 603879292
999999  2014    1   9   12  27  10.98   13.265  -89.835 55.0    4.5 0.0 0.0 0.0 603947030
999999  2014    1   9   20  57  44.88   23.273  -80.778 15.0    5.1 0.0 0.0 0.0 603947340

另一个是out.res(第一列是电台名称):

061Z    56.72   0.0 P   603879074
061Z    29.92   0.0 P   603879074
0614    46.24   0.0 P   603879292
109C    87.51   0.0 P   603947030
113A    66.93   0.0 P   603947030
113A    26.93   0.0 P   603947030
121A    31.49   0.0 P   603947340

两个数据框中的最后一列都是ID。 我想创建一个新的数据帧,以这种方式将两个数据帧中的相同 ID 放在一起(首先从 hdf 读取一行,然后将 res 中具有相同 ID 的行放在其下方,但不将 ID 保留在 res 中) 。

新数据帧

"999999 2014    1   2   15  19  45.19   14.095  -91.528 69.7    4.5 0.0 0.0 0.0 603879074"
061Z    56.72   0.0 P
061Z    29.92   0.0 P
"999999 2014    1   2   23  53  57.58   16.128  -97.815 23.2    4.8 0.0 0.0 0.0 603879292"
0614    46.24   0.0 P
"999999 2014    1   9   12  27  10.98   13.265  -89.835 55.0    4.5 0.0 0.0 0.0 603947030"
109C    87.51   0.0 P
113A    66.93   0.0 P
113A    26.93   0.0 P
"999999 2014    1   9   20  57  44.88   23.273  -80.778 15.0    5.1 0.0 0.0 0.0 603947340"
121A    31.49   0.0 P
                           

我执行此操作的代码是:

import csv
import pandas as pd
import numpy as np

path= './'
hdf = pd.read_csv(path + 'out.hdf', delimiter = '\t', header = None)
res = pd.read_csv(path + 'out.res', delimiter = '\t', header = None)


###creating input to the format of ph2dt-jp/ph
with open('./new_df', 'w', encoding='UTF8') as f:
    writer = csv.writer(f, delimiter='\t')

    i=0
    with open('./out.hdf', 'r') as a_file:
        for line in a_file:
            liney = line.strip()
            writer.writerow(np.array([liney]))
            print(liney)
            j=0
            with open('./out.res', 'r') as a_file:
                for line in a_file:
                    if res.iloc[j, 4] == hdf.iloc[i, 14]:
                        strng = res.iloc[j, [0, 1, 2, 3]]
                        print(strng)
                        writer.writerow(np.array(strng))
                    j+=1
            i+=1

目标是在第三个数据帧中仅保留唯一的站点。在创建第三个数据帧之前,我使用这些命令来保留唯一的站点:

res.drop_duplicates([0], keep = 'last', inplace = True)

并且

res.groupby([0], as_index = False).last()

它工作正常。问题是对于包含数千行的大型数据集,使用这些命令会导致第三个数据帧中省略 res 文件的某些行。 您能否让我知道我应该做什么才能为大型数据集提供相同的结果? 我快要疯了,感谢您提前抽出时间并提供帮助。

I am new to pandas so sorry for naiveté.

I have two dataframe.
One is out.hdf:

999999  2014    1   2   15  19  45.19   14.095  -91.528 69.7    4.5 0.0 0.0 0.0 603879074
999999  2014    1   2   23  53  57.58   16.128  -97.815 23.2    4.8 0.0 0.0 0.0 603879292
999999  2014    1   9   12  27  10.98   13.265  -89.835 55.0    4.5 0.0 0.0 0.0 603947030
999999  2014    1   9   20  57  44.88   23.273  -80.778 15.0    5.1 0.0 0.0 0.0 603947340

and another one is out.res (the first column is station name):

061Z    56.72   0.0 P   603879074
061Z    29.92   0.0 P   603879074
0614    46.24   0.0 P   603879292
109C    87.51   0.0 P   603947030
113A    66.93   0.0 P   603947030
113A    26.93   0.0 P   603947030
121A    31.49   0.0 P   603947340

The last columns in both dataframes are ID.
I want to creat a new dataframe which puts the same IDs from two dataframes together in this way (first reads a line from hdf, then puts the lines from res with the same ID beneath it, but doesn't keep the ID in res).

The new dataframe:

"999999 2014    1   2   15  19  45.19   14.095  -91.528 69.7    4.5 0.0 0.0 0.0 603879074"
061Z    56.72   0.0 P
061Z    29.92   0.0 P
"999999 2014    1   2   23  53  57.58   16.128  -97.815 23.2    4.8 0.0 0.0 0.0 603879292"
0614    46.24   0.0 P
"999999 2014    1   9   12  27  10.98   13.265  -89.835 55.0    4.5 0.0 0.0 0.0 603947030"
109C    87.51   0.0 P
113A    66.93   0.0 P
113A    26.93   0.0 P
"999999 2014    1   9   20  57  44.88   23.273  -80.778 15.0    5.1 0.0 0.0 0.0 603947340"
121A    31.49   0.0 P
                           

My code to do this is:

import csv
import pandas as pd
import numpy as np

path= './'
hdf = pd.read_csv(path + 'out.hdf', delimiter = '\t', header = None)
res = pd.read_csv(path + 'out.res', delimiter = '\t', header = None)


###creating input to the format of ph2dt-jp/ph
with open('./new_df', 'w', encoding='UTF8') as f:
    writer = csv.writer(f, delimiter='\t')

    i=0
    with open('./out.hdf', 'r') as a_file:
        for line in a_file:
            liney = line.strip()
            writer.writerow(np.array([liney]))
            print(liney)
            j=0
            with open('./out.res', 'r') as a_file:
                for line in a_file:
                    if res.iloc[j, 4] == hdf.iloc[i, 14]:
                        strng = res.iloc[j, [0, 1, 2, 3]]
                        print(strng)
                        writer.writerow(np.array(strng))
                    j+=1
            i+=1

The goal is to keep just unique stations in the 3rd dataframe. I used these commands for res to keep unique stations before creating the 3rd dataframe:

res.drop_duplicates([0], keep = 'last', inplace = True)

and

res.groupby([0], as_index = False).last()

and it works fine. The problem is for a large data set, including thousands of lines, using these commands causes some lines of res file to be omitted in the 3rd dataframe.
Could you please let me know what I should do to give the same result for a large dataset?
I am going crazy and thanks for your time and help in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

伏妖词 2025-01-20 22:03:56

我发现了这个问题,并希望它对将来的其他人有帮助。
在大型数据集中,重复的站点重复多次但不连续。 Drop_duplicates() 仅保留其中之一。
然而,我只想删除连续的电台,而不是全部。我使用 shift 完成了此操作:

unique_stations = res.loc[res[0].shift() != res[0]]

I found the problem and hope it is helpful for others in the future.
In a large data set, the duplicated stations were repeating many times but not consecutively. Drop_duplicates() were keeping just one of them.
However, I wanted to drop just consecutive stations not all of them. And I've done this using shift:

unique_stations = res.loc[res[0].shift() != res[0]]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文