大数据集的 pandas 中的 drop_duplicates

发布于 2025-01-13 22:03:56 字数 2736 浏览 2 评论 0原文

我是熊猫新手，很抱歉我的天真。

我有两个数据框。一个是out.hdf：

999999  2014    1   2   15  19  45.19   14.095  -91.528 69.7    4.5 0.0 0.0 0.0 603879074
999999  2014    1   2   23  53  57.58   16.128  -97.815 23.2    4.8 0.0 0.0 0.0 603879292
999999  2014    1   9   12  27  10.98   13.265  -89.835 55.0    4.5 0.0 0.0 0.0 603947030
999999  2014    1   9   20  57  44.88   23.273  -80.778 15.0    5.1 0.0 0.0 0.0 603947340

另一个是out.res（第一列是电台名称）：

061Z    56.72   0.0 P   603879074
061Z    29.92   0.0 P   603879074
0614    46.24   0.0 P   603879292
109C    87.51   0.0 P   603947030
113A    66.93   0.0 P   603947030
113A    26.93   0.0 P   603947030
121A    31.49   0.0 P   603947340

两个数据框中的最后一列都是ID。我想创建一个新的数据帧，以这种方式将两个数据帧中的相同 ID 放在一起（首先从 hdf 读取一行，然后将 res 中具有相同 ID 的行放在其下方，但不将 ID 保留在 res 中）。

新数据帧：

"999999 2014    1   2   15  19  45.19   14.095  -91.528 69.7    4.5 0.0 0.0 0.0 603879074"
061Z    56.72   0.0 P
061Z    29.92   0.0 P
"999999 2014    1   2   23  53  57.58   16.128  -97.815 23.2    4.8 0.0 0.0 0.0 603879292"
0614    46.24   0.0 P
"999999 2014    1   9   12  27  10.98   13.265  -89.835 55.0    4.5 0.0 0.0 0.0 603947030"
109C    87.51   0.0 P
113A    66.93   0.0 P
113A    26.93   0.0 P
"999999 2014    1   9   20  57  44.88   23.273  -80.778 15.0    5.1 0.0 0.0 0.0 603947340"
121A    31.49   0.0 P

我执行此操作的代码是：

import csv
import pandas as pd
import numpy as np

path= './'
hdf = pd.read_csv(path + 'out.hdf', delimiter = '\t', header = None)
res = pd.read_csv(path + 'out.res', delimiter = '\t', header = None)


###creating input to the format of ph2dt-jp/ph
with open('./new_df', 'w', encoding='UTF8') as f:
    writer = csv.writer(f, delimiter='\t')

    i=0
    with open('./out.hdf', 'r') as a_file:
        for line in a_file:
            liney = line.strip()
            writer.writerow(np.array([liney]))
            print(liney)
            j=0
            with open('./out.res', 'r') as a_file:
                for line in a_file:
                    if res.iloc[j, 4] == hdf.iloc[i, 14]:
                        strng = res.iloc[j, [0, 1, 2, 3]]
                        print(strng)
                        writer.writerow(np.array(strng))
                    j+=1
            i+=1

目标是在第三个数据帧中仅保留唯一的站点。在创建第三个数据帧之前，我使用这些命令来保留唯一的站点：

res.drop_duplicates([0], keep = 'last', inplace = True)

并且

res.groupby([0], as_index = False).last()

它工作正常。问题是对于包含数千行的大型数据集，使用这些命令会导致第三个数据帧中省略 res 文件的某些行。您能否让我知道我应该做什么才能为大型数据集提供相同的结果？我快要疯了，感谢您提前抽出时间并提供帮助。

原文

I am new to pandas so sorry for naiveté.

I have two dataframe.
One is out.hdf:

999999  2014    1   2   15  19  45.19   14.095  -91.528 69.7    4.5 0.0 0.0 0.0 603879074
999999  2014    1   2   23  53  57.58   16.128  -97.815 23.2    4.8 0.0 0.0 0.0 603879292
999999  2014    1   9   12  27  10.98   13.265  -89.835 55.0    4.5 0.0 0.0 0.0 603947030
999999  2014    1   9   20  57  44.88   23.273  -80.778 15.0    5.1 0.0 0.0 0.0 603947340

and another one is out.res (the first column is station name):

061Z    56.72   0.0 P   603879074
061Z    29.92   0.0 P   603879074
0614    46.24   0.0 P   603879292
109C    87.51   0.0 P   603947030
113A    66.93   0.0 P   603947030
113A    26.93   0.0 P   603947030
121A    31.49   0.0 P   603947340

The last columns in both dataframes are ID.
I want to creat a new dataframe which puts the same IDs from two dataframes together in this way (first reads a line from hdf, then puts the lines from res with the same ID beneath it, but doesn't keep the ID in res).

The new dataframe:

"999999 2014    1   2   15  19  45.19   14.095  -91.528 69.7    4.5 0.0 0.0 0.0 603879074"
061Z    56.72   0.0 P
061Z    29.92   0.0 P
"999999 2014    1   2   23  53  57.58   16.128  -97.815 23.2    4.8 0.0 0.0 0.0 603879292"
0614    46.24   0.0 P
"999999 2014    1   9   12  27  10.98   13.265  -89.835 55.0    4.5 0.0 0.0 0.0 603947030"
109C    87.51   0.0 P
113A    66.93   0.0 P
113A    26.93   0.0 P
"999999 2014    1   9   20  57  44.88   23.273  -80.778 15.0    5.1 0.0 0.0 0.0 603947340"
121A    31.49   0.0 P

My code to do this is:

import csv
import pandas as pd
import numpy as np

path= './'
hdf = pd.read_csv(path + 'out.hdf', delimiter = '\t', header = None)
res = pd.read_csv(path + 'out.res', delimiter = '\t', header = None)


###creating input to the format of ph2dt-jp/ph
with open('./new_df', 'w', encoding='UTF8') as f:
    writer = csv.writer(f, delimiter='\t')

    i=0
    with open('./out.hdf', 'r') as a_file:
        for line in a_file:
            liney = line.strip()
            writer.writerow(np.array([liney]))
            print(liney)
            j=0
            with open('./out.res', 'r') as a_file:
                for line in a_file:
                    if res.iloc[j, 4] == hdf.iloc[i, 14]:
                        strng = res.iloc[j, [0, 1, 2, 3]]
                        print(strng)
                        writer.writerow(np.array(strng))
                    j+=1
            i+=1

The goal is to keep just unique stations in the 3rd dataframe. I used these commands for res to keep unique stations before creating the 3rd dataframe:

res.drop_duplicates([0], keep = 'last', inplace = True)

and

res.groupby([0], as_index = False).last()

and it works fine. The problem is for a large data set, including thousands of lines, using these commands causes some lines of res file to be omitted in the 3rd dataframe.
Could you please let me know what I should do to give the same result for a large dataset?
I am going crazy and thanks for your time and help in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伏妖词 2025-01-20 22:03:56

我发现了这个问题，并希望它对将来的其他人有帮助。
在大型数据集中，重复的站点重复多次但不连续。 Drop_duplicates() 仅保留其中之一。
然而，我只想删除连续的电台，而不是全部。我使用 shift 完成了此操作：

unique_stations = res.loc[res[0].shift() != res[0]]

I found the problem and hope it is helpful for others in the future.
In a large data set, the duplicated stations were repeating many times but not consecutively. Drop_duplicates() were keeping just one of them.
However, I wanted to drop just consecutive stations not all of them. And I've done this using shift:

unique_stations = res.loc[res[0].shift() != res[0]]

回复收藏 0 原文

~没有更多了~

关于作者

掩饰不了的爱

暂无简介

文章

328 人气

关注发私信

友情链接

文江博客

大数据集的 pandas 中的 drop_duplicates

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

牛↙奶布丁

COSO

落叶

暗地喜欢

qq_i8qOEG

qq_Wl4Sbi

友情链接

大数据集的 pandas 中的 drop_duplicates

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

牛↙奶布丁

COSO

落叶

暗地喜欢

qq_i8qOEG

qq_Wl4Sbi

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。