您如何从dask掉下行,值计数不符合某个阈值?
我正在使用一个相当大的数据集。未压缩的CSV约为20 GB。我正在尝试使用dask,但对此并不熟悉。我通常使用熊猫。我试图将列中特定值的实例数量的数量小于一定阈值的情况下丢弃行。以下是一个示例:
原始数据集:
|icao | callsign | reg | acftType |
|-------| -------- |------- | -------- |
|abcdef | ETH720 | ET-ASJ | B738 |
|abcdef | ETH720 | ET-ASJ | B738 |
|abcdef | ETH720 | ET-ASJ | B738 |
|123456 | IBE6827 | EC-LUK | A333 |
|123456 | IBE6827 | EC-LUK | A333 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
如果阈值为3,则结果框架将是:
|icao | callsign | reg | acftType |
|-------| -------- |------- | -------- |
|abcdef | ETH720 | ET-ASJ | B738 |
|abcdef | ETH720 | ET-ASJ | B738 |
|abcdef | ETH720 | ET-ASJ | B738 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
我找到了一种方法,但似乎非常复杂。我觉得应该有一种简单的方法。在熊猫中,这就是这样:
threshold = 3
inputFrame = inputFrame.groupby('icao').filter(lambda x: len(x) >= threshold)
但是,dask中没有过滤器()。这是我要工作的综合代码:
threshold = 3
a = inputFrame.groupby('icao').count().reg
a = a.to_frame()
a = a.rename(columns={'reg':'count'})
inputFrame = inputFrame.merge(a, how='left', on='icao')
inputFrame = inputFrame[(inputFrame['count'] >= threshold )]
是否有一种更简单的方法可以做到这一点?
I'm working with a fairly large dataset. The uncompressed CSV is about 20 GB. I'm trying to use Dask, but am not very familiar with it. I usually use Pandas. I'm trying to drop rows where the number of instances of a particular value in a column are less than a certain threshold. Here is an example:
Original dataset:
|icao | callsign | reg | acftType |
|-------| -------- |------- | -------- |
|abcdef | ETH720 | ET-ASJ | B738 |
|abcdef | ETH720 | ET-ASJ | B738 |
|abcdef | ETH720 | ET-ASJ | B738 |
|123456 | IBE6827 | EC-LUK | A333 |
|123456 | IBE6827 | EC-LUK | A333 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
If the threshold is 3, the resulting dataframe would be:
|icao | callsign | reg | acftType |
|-------| -------- |------- | -------- |
|abcdef | ETH720 | ET-ASJ | B738 |
|abcdef | ETH720 | ET-ASJ | B738 |
|abcdef | ETH720 | ET-ASJ | B738 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
|789ghi | FRH571 | OO-ACE | B744 |
I found a way to do it, but it seems very convoluted. I feel like there should be a simpler way. In Pandas it would be this:
threshold = 3
inputFrame = inputFrame.groupby('icao').filter(lambda x: len(x) >= threshold)
However, there is no filter() in Dask. Here is the convoluted code that I got to work:
threshold = 3
a = inputFrame.groupby('icao').count().reg
a = a.to_frame()
a = a.rename(columns={'reg':'count'})
inputFrame = inputFrame.merge(a, how='left', on='icao')
inputFrame = inputFrame[(inputFrame['count'] >= threshold )]
Is there an easier way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这个问题的答案可能取决于“更轻松”的定义,但这是做到这一点的另一种方法:
策略#1 :构建
iCao
用虚拟列系列对于Groupby和Count,请与初始DF一起加入,然后删除虚拟列。策略#2 :进行groupby并计数,然后从初始DF删除所有剩余的列,并与初始DF一起加入。
测试输入:
输出:
The answer to this question may depend on the definition of 'easier', but here are two alternative ways to do it:
Strategy #1: Build
icao
series up with dummy column for groupby and count, join with initial df, then drop dummy column.Strategy #2: Do groupby and count, then drop all remaining columns from initial df and join with initial df.
Test input:
Output: