熊猫合并101

发布于 2025-01-21 08:19:15 字数 1422 浏览 0 评论 0 原文

  • 我如何执行(内部 |(<代码>左 | | fult ofter 与熊猫一起加入
  • 合并后,如何添加NANS以丢失行?
  • 合并后如何摆脱NAN?
  • 我可以在索引上合并吗?
  • 如何合并多个数据范围?
  • 交叉加入pandas
  • 合并加入 concat 更新? WHO?什么?为什么?!

...还有更多。我已经看到这些反复出现的问题询问了大熊猫合并功能的各个方面。有关合并及其各种用例的大多数信息都在数十个措辞不好,无法搜索的帖子中分散。这里的目的是整理一些更重要的后代。

此Q&amp; a将是一系列有用的用户指南中的下一部分(请参阅有关枢纽有关串联的这篇文章,我将访问,稍后将介绍)。

请注意,这篇文章是不是文档,所以也请阅读!其中一些例子是从那里取的。


目录

以易于访问。

  • How can I perform a (INNER| (LEFT|RIGHT|FULL) OUTER) JOIN with pandas?
  • How do I add NaNs for missing rows after a merge?
  • How do I get rid of NaNs after merging?
  • Can I merge on the index?
  • How do I merge multiple DataFrames?
  • Cross join with pandas
  • merge? join? concat? update? Who? What? Why?!

... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.

This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).

Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.


Table of Contents

For ease of access.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

秋日私语 2025-01-28 08:19:15

这篇文章旨在为读者提供与大熊猫的合并,如何使用它以及不使用它的介绍。

特别是,这是本文将要经历的内容:

  • 基础知识 - 连接的类型(左,右,外,内)

    • 与不同的列名合并
    • 与多列合并
    • 避免在输出中重复合并键列

(以及我在此线程上的其他帖子)将无法通过:

  • performand - 相关的讨论和时间(目前)。在适当的情况下,通常值得注意的是更好的替代方案。
  • 处理后缀,删除额外的列,重命名输出和其他特定用例。还有其他(阅读:更好的)帖子可以解决这个问题,因此请弄清楚!

注意
除非另有说明,否则大多数示例默认为内连接操作,同时演示各种功能。

此外,可以复制和复制所有数据框架
您可以和他们一起玩。另外,请参见
帖子

关于如何从剪贴板读取数据框。

最后,使用Google图纸手绘加入操作的所有视觉表示。灵感来自在这里



足够的谈话 - 只需告诉我如何使用 Merge

设置&amp; 基础知识的

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

为了简单起见,

键列具有相同的名称(目前)。 内联机

注意
这与即将到来的数字一起遵循此惯例:

  • 蓝色指示合并结果中存在的行
  • 红色指示从结果中排除的行(即删除)
  • 绿色指示结果
  • 中被 nan s替换的缺失值

为执行内在加入,呼叫

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

这仅返回共享一个共同密钥的行(在本示例中,“ b”和“ d)。a

左外' >,

如果您指定,则可以通过指定来指定

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

nans的位置。 ,然后仅使用左右的键,而右边的数据被NAN替换为

右外的JOIN ,或者替换 JOIN是

...。

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

了数据。右 取代

NAN

>由

left.merge(right, on='key', how='outer')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

。 。

中缺少行

两者 。


​>左排出加入和右排除在两个步骤中加入

对于左排出的联接,

以左外的连接的执行,然后过滤为 仅(不包括右边的所有内容),

(left.merge(right, on='key', how='left', indicator=True)
     .query('_merge == "left_only"')
     .drop('_merge', axis=1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN

在其中,在其中,

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

以及类似地,对于右式排除JOIN, 最后,

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', axis=1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

如果您需要进行合并,该合并仅保留左右的钥匙,而不是两者(IOW,执行 Anti-Join ),

则可以以类似的方式进行此操作 -

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', axis=1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

<强>键列的不同名称

如果键列的命名不同 - 例如,具有 keyleft ,而 right has 键入而不是 - 然后您必须指定 left_on right_on 作为参数,而不是 on

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

keyleft 上合并 键> keyright >右,如果您仅想要 keyleft 键>键入(但不是两个),则可以首先将索引设置为初步步。

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

将其与命令的输出进行对比(即 lewd.cresge的输出(right2,left_on ='keyleft',right_on ='keyright',how ='innear'')),您会注意到 keyleft 缺少。您可以根据将哪个框架的索引设置为键来确定要保留的列。当执行一些外部联接操作时,这可能很重要。


仅从 dataframes之一 中合并一个

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

仅在合并之前的子集列:

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

如果您要执行左外连接,则更性能的解决方案将涉及 map

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

如前所述,这与

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

在多个列上合并< /strong>

要在多个列上加入,请在上指定列表(或 left_on 和 right_on ,请及时及时)。

left.merge(right, on=['key1', 'key2'] ...)

或者,如果名称不同,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

其他有用的合并*操作和函数

本节仅涵盖了基本知识,旨在仅促进您的食欲。有关更多示例和案例,请参见 on merge 加入 concat 以及指向功能规范的链接。



继续阅读

跳转跳到熊猫中的其他主题,以继续学习:继续学习:

*您在这里。

This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.

In particular, here's what this post will go through:

  • The basics - types of joins (LEFT, RIGHT, OUTER, INNER)

    • merging with different column names
    • merging with multiple columns
    • avoiding duplicate merge key column in output

What this post (and other posts by me on this thread) will not go through:

  • Performance-related discussions and timings (for now). Mostly notable mentions of better alternatives, wherever appropriate.
  • Handling suffixes, removing extra columns, renaming outputs, and other specific use cases. There are other (read: better) posts that deal with that, so figure it out!

Note
Most examples default to INNER JOIN operations while demonstrating various features, unless otherwise specified.

Furthermore, all the DataFrames here can be copied and replicated so
you can play with them. Also, see this
post

on how to read DataFrames from your clipboard.

Lastly, all visual representation of JOIN operations have been hand-drawn using Google Drawings. Inspiration from here.



Enough talk - just show me how to use merge!

Setup & Basics

np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})

left

  key     value
0   A  1.764052
1   B  0.400157
2   C  0.978738
3   D  2.240893

right

  key     value
0   B  1.867558
1   D -0.977278
2   E  0.950088
3   F -0.151357

For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by

Note
This, along with the forthcoming figures all follow this convention:

  • blue indicates rows that are present in the merge result
  • red indicates rows that are excluded from the result (i.e., removed)
  • green indicates missing values that are replaced with NaNs in the result

To perform an INNER JOIN, call merge on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.

left.merge(right, on='key')
# Or, if you want to be explicit
# left.merge(right, on='key', how='inner')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278

This returns only rows from left and right which share a common key (in this example, "B" and "D).

A LEFT OUTER JOIN, or LEFT JOIN is represented by

This can be performed by specifying how='left'.

left.merge(right, on='key', how='left')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278

Carefully note the placement of NaNs here. If you specify how='left', then only keys from left are used, and missing data from right is replaced by NaN.

And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...

...specify how='right':

left.merge(right, on='key', how='right')

  key   value_x   value_y
0   B  0.400157  1.867558
1   D  2.240893 -0.977278
2   E       NaN  0.950088
3   F       NaN -0.151357

Here, keys from right are used, and missing data from left is replaced by NaN.

Finally, for the FULL OUTER JOIN, given by

specify how='outer'.

left.merge(right, on='key', how='outer')

  key   value_x   value_y
0   A  1.764052       NaN
1   B  0.400157  1.867558
2   C  0.978738       NaN
3   D  2.240893 -0.977278
4   E       NaN  0.950088
5   F       NaN -0.151357

This uses the keys from both frames, and NaNs are inserted for missing rows in both.

The documentation summarizes these various merges nicely:

Enter image description here


Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs

If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.

For LEFT-Excluding JOIN, represented as

Start by performing a LEFT OUTER JOIN and then filtering to rows coming from left only (excluding everything from the right),

(left.merge(right, on='key', how='left', indicator=True)
     .query('_merge == "left_only"')
     .drop('_merge', axis=1))

  key   value_x  value_y
0   A  1.764052      NaN
2   C  0.978738      NaN

Where,

left.merge(right, on='key', how='left', indicator=True)

  key   value_x   value_y     _merge
0   A  1.764052       NaN  left_only
1   B  0.400157  1.867558       both
2   C  0.978738       NaN  left_only
3   D  2.240893 -0.977278       both

And similarly, for a RIGHT-Excluding JOIN,

(left.merge(right, on='key', how='right', indicator=True)
     .query('_merge == "right_only"')
     .drop('_merge', axis=1))

  key  value_x   value_y
2   E      NaN  0.950088
3   F      NaN -0.151357

Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),

You can do this in similar fashion—

(left.merge(right, on='key', how='outer', indicator=True)
     .query('_merge != "both"')
     .drop('_merge', axis=1))

  key   value_x   value_y
0   A  1.764052       NaN
2   C  0.978738       NaN
4   E       NaN  0.950088
5   F       NaN -0.151357

Different names for key columns

If the key columns are named differently—for example, left has keyLeft, and right has keyRight instead of key—then you will have to specify left_on and right_on as arguments instead of on:

left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)

left2

  keyLeft     value
0       A  1.764052
1       B  0.400157
2       C  0.978738
3       D  2.240893

right2

  keyRight     value
0        B  1.867558
1        D -0.977278
2        E  0.950088
3        F -0.151357
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')

  keyLeft   value_x keyRight   value_y
0       B  0.400157        B  1.867558
1       D  2.240893        D -0.977278

Avoiding duplicate key column in output

When merging on keyLeft from left and keyRight from right, if you only want either of the keyLeft or keyRight (but not both) in the output, you can start by setting the index as a preliminary step.

left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')

    value_x keyRight   value_y
0  0.400157        B  1.867558
1  2.240893        D -0.977278

Contrast this with the output of the command just before (that is, the output of left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')), you'll notice keyLeft is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.


Merging only a single column from one of the DataFrames

For example, consider

right3 = right.assign(newcol=np.arange(len(right)))
right3
  key     value  newcol
0   B  1.867558       0
1   D -0.977278       1
2   E  0.950088       2
3   F -0.151357       3

If you are required to merge only "newcol" (without any of the other columns), you can usually just subset columns before merging:

left.merge(right3[['key', 'newcol']], on='key')

  key     value  newcol
0   B  0.400157       0
1   D  2.240893       1

If you're doing a LEFT OUTER JOIN, a more performant solution would involve map:

# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

As mentioned, this is similar to, but faster than

left.merge(right3[['key', 'newcol']], on='key', how='left')

  key     value  newcol
0   A  1.764052     NaN
1   B  0.400157     0.0
2   C  0.978738     NaN
3   D  2.240893     1.0

Merging on multiple columns

To join on more than one column, specify a list for on (or left_on and right_on, as appropriate).

left.merge(right, on=['key1', 'key2'] ...)

Or, in the event the names are different,

left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])

Other useful merge* operations and functions

This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on merge, join, and concat as well as the links to the function specifications.



Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

*You are here.

孤独岁月 2025-01-28 08:19:15

pd.concat的补充视觉视图([DF0,DF1],Kwargs)
请注意,Kwarg axis = 0 axis = 1 的含义不像 df.mean() df那样直观。


​.jpg“ alt =” pd.concat([DF0,df1])>

A supplemental visual view of pd.concat([df0, df1], kwargs).
Notice that, kwarg axis=0 or axis=1 's meaning is not as intuitive as df.mean() or df.apply(func)


on pd.concat([df0, df1])

爱的故事 2025-01-28 08:19:15

加入101

这些动画可能会更好地在视觉上解释您。
积分: garrick aden-buie tidyexplain repo

内在加入

外部或完整的JOIN或FULL JOIN或FULL JOIN

< a href =“ https://i.sstatic.net/dg8mw.gif” rel =“ noreferrer”>

右JOIN

”在此处输入图像描述“

左JOIN

”在此处输入图像描述”

Joins 101

These animations might be better to explain you visually.
Credits: Garrick Aden-Buie tidyexplain repo

Inner Join

enter image description here

Outer Join or Full Join

enter image description here

Right Join

enter image description here

Left Join

enter image description here

颜漓半夏 2025-01-28 08:19:15

在此答案中,我将考虑以下实际示例:

  1. pandas.concat

  2. pandas。 dataframe.merge 从一个和另一个列的索引合并数据框。

我们将在每种情况下使用不同的数据框。


1。 a>

使用以下 dataframes 具有相同的列名:

  • price2018 带尺寸(8784,5) P>

     年度每日小时价格
    0 2018 1 1 1 6.74
    1 2018 1 1 2 4.74
    2 2018 1 1 3 3.66
    3 2018 1 1 4 2.30
    4 2018 1 1 5 2.30
    5 2018 1 1 6 2.06
    6 2018 1 1 7 2.06
    7 2018 1 1 8 2.06
    8 2018 1 1 9 2.30
    9 2018 1 1 10 2.30
     
  • Price2019 带尺寸(8760,5)

     年度每日小时价格
    0 2019 1 1 1 66.88
    1 2019 1 1 2 66.88
    2 2019 1 1 3 66.00
    3 2019 1 1 4 63.64
    4 2019 1 1 5 58.85
    5 2019 1 1 6 55.47
    6 2019 1 1 7 56.00
    7 2019 1 1 8 61.09
    8 2019 1 1 9 61.01
    9 2019 1 1 10 61.00
     

可以使用 pandas.concat ,简单地

import pandas as pd

frames = [Price2018, Price2019]

df_merged = pd.concat(frames)

从中导致具有size (17544,5)的数据框架

如果一个人想清楚地了解发生的事情,它可以像这样的工作

”


2。 代码>

在本节中,我们将考虑一个特定情况:合并一个数据框的索引和另一个数据框的列

假设一个人具有 geo ,带有 54 列,是 date type dateTime64 [NS ]

                 Date         1         2  ...        51        52        53
0 2010-01-01 00:00:00  0.565919  0.892376  ...  0.593049  0.775082  0.680621
1 2010-01-01 01:00:00  0.358960  0.531418  ...  0.734619  0.480450  0.926735
2 2010-01-01 02:00:00  0.531870  0.221768  ...  0.902369  0.027840  0.398864
3 2010-01-01 03:00:00  0.475463  0.245810  ...  0.306405  0.645762  0.541882
4 2010-01-01 04:00:00  0.954546  0.867960  ...  0.912257  0.039772  0.627696

和dataFrame Price 具有一列,其价格为 Price> ,索引对应于日期( date ),

                     Price
Date                      
2010-01-01 00:00:00  29.10
2010-01-01 01:00:00   9.57
2010-01-01 02:00:00   0.00
2010-01-01 03:00:00   0.00
2010-01-01 04:00:00   0.00

以便合并它们,可以使用 .merge 如下所示

df_merged = pd.merge(Price, Geo, left_index=True, right_on='Date')

,其中 geo Price 是以前的数据框架。

这将导致以下数据框架

   Price                Date         1  ...        51        52        53
0  29.10 2010-01-01 00:00:00  0.565919  ...  0.593049  0.775082  0.680621
1   9.57 2010-01-01 01:00:00  0.358960  ...  0.734619  0.480450  0.926735
2   0.00 2010-01-01 02:00:00  0.531870  ...  0.902369  0.027840  0.398864
3   0.00 2010-01-01 03:00:00  0.475463  ...  0.306405  0.645762  0.541882
4   0.00 2010-01-01 04:00:00  0.954546  ...  0.912257  0.039772  0.627696

In this answer, I will consider practical examples of:

  1. pandas.concat

  2. pandas.DataFrame.merge to merge dataframes from the index of one and the column of another one.

We will be using different dataframes for each of the cases.


1. pandas.concat

Considering the following DataFrames with the same column names:

  • Price2018 with size (8784, 5)

       Year  Month  Day  Hour  Price
    0  2018      1    1     1   6.74
    1  2018      1    1     2   4.74
    2  2018      1    1     3   3.66
    3  2018      1    1     4   2.30
    4  2018      1    1     5   2.30
    5  2018      1    1     6   2.06
    6  2018      1    1     7   2.06
    7  2018      1    1     8   2.06
    8  2018      1    1     9   2.30
    9  2018      1    1    10   2.30
    
  • Price2019 with size (8760, 5)

       Year  Month  Day  Hour  Price
    0  2019      1    1     1  66.88
    1  2019      1    1     2  66.88
    2  2019      1    1     3  66.00
    3  2019      1    1     4  63.64
    4  2019      1    1     5  58.85
    5  2019      1    1     6  55.47
    6  2019      1    1     7  56.00
    7  2019      1    1     8  61.09
    8  2019      1    1     9  61.01
    9  2019      1    1    10  61.00
    

One can combine them using pandas.concat, by simply

import pandas as pd

frames = [Price2018, Price2019]

df_merged = pd.concat(frames)

Which results in a DataFrame with size (17544, 5)

If one wants to have a clear picture of what happened, it works like this

How concat works

(Source)


2. pandas.DataFrame.merge

In this section, we will consider a specific case: merging the index of one dataframe and the column of another dataframe.

Let's say one has the dataframe Geo with 54 columns, being one of the columns the Date, which is of type datetime64[ns].

                 Date         1         2  ...        51        52        53
0 2010-01-01 00:00:00  0.565919  0.892376  ...  0.593049  0.775082  0.680621
1 2010-01-01 01:00:00  0.358960  0.531418  ...  0.734619  0.480450  0.926735
2 2010-01-01 02:00:00  0.531870  0.221768  ...  0.902369  0.027840  0.398864
3 2010-01-01 03:00:00  0.475463  0.245810  ...  0.306405  0.645762  0.541882
4 2010-01-01 04:00:00  0.954546  0.867960  ...  0.912257  0.039772  0.627696

And the dataframe Price that has one column with the price named Price, and the index corresponds to the dates (Date)

                     Price
Date                      
2010-01-01 00:00:00  29.10
2010-01-01 01:00:00   9.57
2010-01-01 02:00:00   0.00
2010-01-01 03:00:00   0.00
2010-01-01 04:00:00   0.00

In order to merge them, one can use pandas.DataFrame.merge as follows

df_merged = pd.merge(Price, Geo, left_index=True, right_on='Date')

where Geo and Price are the previous dataframes.

That results in the following dataframe

   Price                Date         1  ...        51        52        53
0  29.10 2010-01-01 00:00:00  0.565919  ...  0.593049  0.775082  0.680621
1   9.57 2010-01-01 01:00:00  0.358960  ...  0.734619  0.480450  0.926735
2   0.00 2010-01-01 02:00:00  0.531870  ...  0.902369  0.027840  0.398864
3   0.00 2010-01-01 03:00:00  0.475463  ...  0.306405  0.645762  0.541882
4   0.00 2010-01-01 04:00:00  0.954546  ...  0.912257  0.039772  0.627696
纵情客 2025-01-28 08:19:15

这篇文章将遍历以下主题:

  • 在不同条件下与索引合并
    • 基于索引的连接选项: MERGE JOIN concat
    • 合并索引
    • 合并另一列的索引
  • 有效地使用命名索引到 另一个索引,简化合并语法

返回top



index index index index index

ind;博士

有一些选项,有些比其他选择要简单
案例。

  1. dataframe。 Merge left_index right_index (或 left_on right> right> right_on 使用命名索引)
    • 支持内/左/右/full
    • 一次只能加入两个
    • 支持列列,索引列,索引索引加入

  2. dataframe.join < /code> (在索引上加入)
    • 支持内/左(默认)/右/full
    • 一次可以加入多个数据范围
    • 支持索引索引加入

  3. pd.concat < /code> (在索引上加入)
    • 支持内部/完整(默认)
    • 一次可以加入多个数据范围
    • 支持索引索引加入




索引索引加入

setup&amp;基础知识

import pandas as pd
import numpy as np

np.random.seed([3, 14])
left = pd.DataFrame(data={'value': np.random.randn(4)}, 
                    index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame(data={'value': np.random.randn(4)},  
                     index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right
 
           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076

通常,索引上的内在连接将看起来像这样:

left.merge(right, left_index=True, right_index=True)

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

其他连接遵循类似的语法。

值得注意的替代方案

  1. dataframe.join 默认值在索引上加入。 dataframe.join 默认情况下左外连接,因此 ='innit'在此处都是必要的。

      left.join(右,how ='inner',lsuffix ='_ x',rsuffix ='_ y')
    
              value_x value_y
     idxkey                    
     B -0.402655 0.543843
     D -0.524349 0.013135
     

    请注意,我需要指定 lsuffix rsuffix 参数,因为 JOIN 否则会出现错误:

      left.join(右)
     ValueError:列重叠,但没有指定后缀:index(['value'],dtype ='object')
     

    由于列名是相同的。如果它们的命名不同,这将不是问题。

      left.rename(columns = {'value':'leftValue'})。join(右,how,how ='inneR')
    
             左值值
     idxkey                     
     B -0.402655 0.543843
     D -0.524349 0.013135
     
  2. pd.concat 加入索引,可以一次加入两个或多个数据范围。默认情况下,它进行了完整的外部加入,因此 how ='inner'在此处需要..

      pd.concat([[左,右],axis = 1,sort = false,join ='innit')
    
                价值
     idxkey                    
     B -0.402655 0.543843
     D -0.524349 0.013135
     

    有关 concat 的更多信息,请参见此帖子


列的索引加入

使用左右列的索引执行内部连接,您将使用 dataframe.merge left_index = true 和 right_on = ...

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
 
  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

其他连接遵循类似的结构。请注意,只有 Merge 才能执行与列连接的索引。您可以在多个列上加入,前提是左侧的索引级别的数量等于右侧的列数。

JOIN concat 无法混合合并。您需要使用 dataframe.set_index


有效地使用命名索引[pandas&gt; = 0.23]

如果您的索引是命名的,则从pandas&gt; = 0.23, dataframe.merge 允许您将索引名称指定到(或必要时和 right_on )。

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

对于上一个与左侧列索引合并的示例,您可以使用 left_on 使用左的索引名称:

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135


继续阅读

跳转跳转到Pandas合并101中的其他主题以继续学习:

*您在这里

This post will go through the following topics:

  • Merging with index under different conditions
    • options for index-based joins: merge, join, concat
    • merging on indexes
    • merging on index of one, column of other
  • effectively using named indexes to simplify merging syntax

BACK TO TOP



Index-based joins

TL;DR

There are a few options, some simpler than others depending on the use
case.

  1. DataFrame.merge with left_index and right_index (or left_on and right_on using named indexes)
    • supports inner/left/right/full
    • can only join two at a time
    • supports column-column, index-column, index-index joins
  2. DataFrame.join (join on index)
    • supports inner/left (default)/right/full
    • can join multiple DataFrames at a time
    • supports index-index joins
  3. pd.concat (joins on index)
    • supports inner/full (default)
    • can join multiple DataFrames at a time
    • supports index-index joins

Index to index joins

Setup & Basics

import pandas as pd
import numpy as np

np.random.seed([3, 14])
left = pd.DataFrame(data={'value': np.random.randn(4)}, 
                    index=['A', 'B', 'C', 'D'])    
right = pd.DataFrame(data={'value': np.random.randn(4)},  
                     index=['B', 'D', 'E', 'F'])
left.index.name = right.index.name = 'idxkey'

left
           value
idxkey          
A      -0.602923
B      -0.402655
C       0.302329
D      -0.524349

right
 
           value
idxkey          
B       0.543843
D       0.013135
E      -0.326498
F       1.385076

Typically, an inner join on index would look like this:

left.merge(right, left_index=True, right_index=True)

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

Other joins follow similar syntax.

Notable Alternatives

  1. DataFrame.join defaults to joins on the index. DataFrame.join does a LEFT OUTER JOIN by default, so how='inner' is necessary here.

     left.join(right, how='inner', lsuffix='_x', rsuffix='_y')
    
              value_x   value_y
     idxkey                    
     B      -0.402655  0.543843
     D      -0.524349  0.013135
    

    Note that I needed to specify the lsuffix and rsuffix arguments since join would otherwise error out:

     left.join(right)
     ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
    

    Since the column names are the same. This would not be a problem if they were differently named.

     left.rename(columns={'value':'leftvalue'}).join(right, how='inner')
    
             leftvalue     value
     idxkey                     
     B       -0.402655  0.543843
     D       -0.524349  0.013135
    
  2. pd.concat joins on the index and can join two or more DataFrames at once. It does a full outer join by default, so how='inner' is required here..

     pd.concat([left, right], axis=1, sort=False, join='inner')
    
                value     value
     idxkey                    
     B      -0.402655  0.543843
     D      -0.524349  0.013135
    

    For more information on concat, see this post.


Index to Column joins

To perform an inner join using index of left, column of right, you will use DataFrame.merge a combination of left_index=True and right_on=....

right2 = right.reset_index().rename({'idxkey' : 'colkey'}, axis=1)
right2
 
  colkey     value
0      B  0.543843
1      D  0.013135
2      E -0.326498
3      F  1.385076

left.merge(right2, left_index=True, right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135

Other joins follow a similar structure. Note that only merge can perform index to column joins. You can join on multiple columns, provided the number of index levels on the left equals the number of columns on the right.

join and concat are not capable of mixed merges. You will need to set the index as a pre-step using DataFrame.set_index.


Effectively using Named Index [pandas >= 0.23]

If your index is named, then from pandas >= 0.23, DataFrame.merge allows you to specify the index name to on (or left_on and right_on as necessary).

left.merge(right, on='idxkey')

         value_x   value_y
idxkey                    
B      -0.402655  0.543843
D      -0.524349  0.013135

For the previous example of merging with the index of left, column of right, you can use left_on with the index name of left:

left.merge(right2, left_on='idxkey', right_on='colkey')

    value_x colkey   value_y
0 -0.402655      B  0.543843
1 -0.524349      D  0.013135


Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

* you are here

尾戒 2025-01-28 08:19:15

这篇文章将遍历以下主题:

  • 如何正确概括多个数据范围(以及为什么 Merge 在此处具有缺点)
  • 在唯一键上
  • 合并在非唯一键

回到TOP



经常对多个数据帧进行推广

,当要合并多个数据帧时,会出现情况。天真地,这可以通过链接合并调用来完成:

df1.merge(df2, ...).merge(df3, ...)

但是,对于许多数据范围,这很快就失控了。此外,可能有必要对未知数量的数据框架进行概括。

在这里,我介绍了 pd.concat unique 键上以及 dataframe.join.join in Multi-Way Joins上 - 唯一键。首先,设置。

# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note: the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')

dfs2 = [A2, B2, C2]

Multiway合并在唯一键上

如果您的键(此处,键可以是列或索引)是唯一的,则可以使用 pd.concat 。请注意, pd.concat 在索引上加入dataframes

# Merge on `key` column. You'll need to set the index before concatenating
pd.concat(
    [df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# Merge on `key` index.
pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0

省略 join ='inner'用于完整的外部联接。请注意,您无法指定左或右外连接(如果需要这些,请使用 JOIN ,如下所述)。


多路与重复的密钥合并

concat 很快,但存在其缺点。它无法处理重复。

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)

在这种情况下,我们可以使用加入,因为它可以处理非唯一的键(请注意, join 在其索引上加入dataframes;它调用 MERGE 在引擎盖下,除非另有说明,否则进行左外连接。

# Join on `key` column. Set as the index first.
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join([B2, C2], how='inner').reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# Join on `key` index.
A3.set_index('key').join([B2, C2], how='inner')

       valueA    valueB  valueC
key                            
D    1.454274 -0.977278     1.0
D    0.761038 -0.977278     1.0


继续阅读

跳到熊猫中的其他主题,合并101以继续学习:

*您在这里

This post will go through the following topics:

  • how to correctly generalize to multiple DataFrames (and why merge has shortcomings here)
  • merging on unique keys
  • merging on non-unique keys

BACK TO TOP



Generalizing to multiple DataFrames

Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining merge calls:

df1.merge(df2, ...).merge(df3, ...)

However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.

Here I introduce pd.concat for multi-way joins on unique keys, and DataFrame.join for multi-way joins on non-unique keys. First, the setup.

# Setup.
np.random.seed(0)
A = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'valueA': np.random.randn(4)})    
B = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'valueB': np.random.randn(4)})
C = pd.DataFrame({'key': ['D', 'E', 'J', 'C'], 'valueC': np.ones(4)})
dfs = [A, B, C] 

# Note: the "key" column values are unique, so the index is unique.
A2 = A.set_index('key')
B2 = B.set_index('key')
C2 = C.set_index('key')

dfs2 = [A2, B2, C2]

Multiway merge on unique keys

If your keys (here, the key could either be a column or an index) are unique, then you can use pd.concat. Note that pd.concat joins DataFrames on the index.

# Merge on `key` column. You'll need to set the index before concatenating
pd.concat(
    [df.set_index('key') for df in dfs], axis=1, join='inner'
).reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# Merge on `key` index.
pd.concat(dfs2, axis=1, sort=False, join='inner')

       valueA    valueB  valueC
key                            
D    2.240893 -0.977278     1.0

Omit join='inner' for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, use join, described below).


Multiway merge on keys with duplicates

concat is fast, but has its shortcomings. It cannot handle duplicates.

A3 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'D'], 'valueA': np.random.randn(5)})
pd.concat([df.set_index('key') for df in [A3, B, C]], axis=1, join='inner')
ValueError: Shape of passed values is (3, 4), indices imply (3, 2)

In this situation, we can use join since it can handle non-unique keys (note that join joins DataFrames on their index; it calls merge under the hood and does a LEFT OUTER JOIN unless otherwise specified).

# Join on `key` column. Set as the index first.
# For inner join. For left join, omit the "how" argument.
A.set_index('key').join([B2, C2], how='inner').reset_index()

  key    valueA    valueB  valueC
0   D  2.240893 -0.977278     1.0

# Join on `key` index.
A3.set_index('key').join([B2, C2], how='inner')

       valueA    valueB  valueC
key                            
D    1.454274 -0.977278     1.0
D    0.761038 -0.977278     1.0


Continue Reading

Jump to other topics in Pandas Merging 101 to continue learning:

* you are here

孤者何惧 2025-01-28 08:19:15

目前,熊猫不支持在合并语法中加入不平等。一个选项是 pyjanitor - 我是该库的贡献者:

# pip install pyjanitor
import pandas as pd
import janitor 

left.conditional_join(right, ('value', 'value', '>'))

   left           right
    key     value   key     value
0     A  1.764052     D -0.977278
1     A  1.764052     F -0.151357
2     A  1.764052     E  0.950088
3     B  0.400157     D -0.977278
4     B  0.400157     F -0.151357
5     C  0.978738     D -0.977278
6     C  0.978738     F -0.151357
7     C  0.978738     E  0.950088
8     D  2.240893     D -0.977278
9     D  2.240893     F -0.151357
10    D  2.240893     E  0.950088
11    D  2.240893     B  1.867558

left.conditional_join(right, ('value', 'value', '<'))

  left           right
   key     value   key     value
0    A  1.764052     B  1.867558
1    B  0.400157     E  0.950088
2    B  0.400157     B  1.867558
3    C  0.978738     B  1.867558

这些列作为变量参数传递元组,每个元组由左数据框中的列,右数据帧的列和加入操作员,可以是(&gt; ,,&lt;,&gt; = ,,&lt =,!=,!=, )。在上面的示例中,由于列名称中的重叠,返回了多索引列。

明智的是,这比幼稚的交叉加入要好:

np.random.seed(0)
dd = pd.DataFrame({'value':np.random.randint(100000, size=50_000)})
df = pd.DataFrame({'start':np.random.randint(100000, size=1_000), 
                   'end':np.random.randint(100000, size=1_000)})

dd.head()

   value
0  68268
1  43567
2  42613
3  45891
4  21243

df.head()

   start    end
0  71915  47005
1  64284  44913
2  13377  96626
3  75823  38673
4  29151    575


%%timeit
out = df.merge(dd, how='cross')
out.loc[(out.start < out.value) & (out.end > out.value)]
5.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
280 ms ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True)
124 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

out = df.merge(dd, how='cross')
out = out.loc[(out.start < out.value) & (out.end > out.value)]
A = df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
columns = A.columns.tolist()
A = A.sort_values(columns, ignore_index = True)
out = out.sort_values(columns, ignore_index = True)

A.equals(out)
True

根据数据大小,当出现Equi连接时,您可以获得更多的性能。在这种情况下,使用PANDAS合并函数,但最终数据框架被延迟,直到计算非Equi连接为止。让我们从在这里查看数据。

import pandas as pd
import numpy as np
import random
import datetime

def random_dt_bw(start_date,end_date):
    days_between = (end_date - start_date).days
    random_num_days = random.randrange(days_between)
    random_dt = start_date + datetime.timedelta(days=random_num_days)
    return random_dt

def generate_data(n=1000):
    items = [f"i_{x}" for x in range(n)]
    start_dates = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(n)]
    end_dates = [x + datetime.timedelta(days=random.randint(1,10)) for x in start_dates]
    
    offerDf = pd.DataFrame({"Item":items,
                            "StartDt":start_dates,
                            "EndDt":end_dates})
    
    transaction_items = [f"i_{random.randint(0,n)}" for x in range(5*n)]
    transaction_dt = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(5*n)]
    sales_amt = [random.randint(0,1000) for x in range(5*n)]
    
    transactionDf = pd.DataFrame({"Item":transaction_items,"TransactionDt":transaction_dt,"Sales":sales_amt})

    return offerDf,transactionDf

offerDf,transactionDf = generate_data(n=100000)


offerDf = (offerDf
           .assign(StartDt = offerDf.StartDt.astype(np.datetime64), 
                   EndDt = offerDf.EndDt.astype(np.datetime64)
                  )
           )

transactionDf = transactionDf.assign(TransactionDt = transactionDf.TransactionDt.astype(np.datetime64))

# you can get more performance when using ints/datetimes
# in the equi join, compared to strings

offerDf = offerDf.assign(Itemr = offerDf.Item.str[2:].astype(int))

transactionDf = transactionDf.assign(Itemr = transactionDf.Item.str[2:].astype(int))

transactionDf.head()
      Item TransactionDt  Sales  Itemr
0  i_43407    2020-05-29    692  43407
1  i_95044    2020-07-22    964  95044
2  i_94560    2020-01-09    462  94560
3  i_11246    2020-02-26    690  11246
4  i_55974    2020-03-07    219  55974

offerDf.head()
  Item    StartDt      EndDt  Itemr
0  i_0 2020-04-18 2020-04-19      0
1  i_1 2020-02-28 2020-03-07      1
2  i_2 2020-03-28 2020-03-30      2
3  i_3 2020-08-03 2020-08-13      3
4  i_4 2020-05-26 2020-06-04      4

# merge on strings 
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
                        (merged_df['TransactionDt']<=merged_df['EndDt'])]

# merge on ints ... usually faster
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &            
                        (merged_df['TransactionDt']<=merged_df['EndDt'])]

# merge on integers
cond_join_int = (transactionDf
                 .conditional_join(
                     offerDf, 
                     ('Itemr', 'Itemr', '=='), 
                     ('TransactionDt', 'StartDt', '>='), 
                     ('TransactionDt', 'EndDt', '<=')
                  )
                 )

# merge on strings
cond_join_str = (transactionDf
                 .conditional_join(
                     offerDf, 
                     ('Item', 'Item', '=='), 
                     ('TransactionDt', 'StartDt', '>='), 
                     ('TransactionDt', 'EndDt', '<=')
                  )
                )

%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
                        (merged_df['TransactionDt']<=merged_df['EndDt'])]
292 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
                        (merged_df['TransactionDt']<=merged_df['EndDt'])]
253 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit 
(transactionDf
.conditional_join(
    offerDf, 
    ('Item', 'Item', '=='), 
    ('TransactionDt', 'StartDt', '>='), 
    ('TransactionDt', 'EndDt', '<=')
   )
)
256 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit 
(transactionDf
.conditional_join(
    offerDf, 
    ('Itemr', 'Itemr', '=='), 
    ('TransactionDt', 'StartDt', '>='), 
    ('TransactionDt', 'EndDt', '<=')
   )
)
71.8 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# check that both dataframes are equal
cols = ['Item', 'TransactionDt', 'Sales', 'Itemr_y','StartDt', 'EndDt', 'Itemr_x']
cond_join_str = cond_join_str.drop(columns=('right', 'Item')).set_axis(cols, axis=1)

(cond_join_str
.sort_values(cond_join_str.columns.tolist())
.reset_index(drop=True)
.reindex(columns=classic_str.columns)
.equals(
    classic_str
    .sort_values(classic_str.columns.tolist())
    .reset_index(drop=True)
))

True

Pandas at the moment does not support inequality joins within the merge syntax; one option is with the conditional_join function from pyjanitor - I am a contributor to this library:

# pip install pyjanitor
import pandas as pd
import janitor 

left.conditional_join(right, ('value', 'value', '>'))

   left           right
    key     value   key     value
0     A  1.764052     D -0.977278
1     A  1.764052     F -0.151357
2     A  1.764052     E  0.950088
3     B  0.400157     D -0.977278
4     B  0.400157     F -0.151357
5     C  0.978738     D -0.977278
6     C  0.978738     F -0.151357
7     C  0.978738     E  0.950088
8     D  2.240893     D -0.977278
9     D  2.240893     F -0.151357
10    D  2.240893     E  0.950088
11    D  2.240893     B  1.867558

left.conditional_join(right, ('value', 'value', '<'))

  left           right
   key     value   key     value
0    A  1.764052     B  1.867558
1    B  0.400157     E  0.950088
2    B  0.400157     B  1.867558
3    C  0.978738     B  1.867558

The columns are passed as a variable argument of tuples, each tuple comprising of a column from the left dataframe, column from the right dataframe, and the join operator, which can be any of (>, <, >=, <=, !=). In the example above, a MultiIndex column is returned, because of overlaps in the column names.

Performance wise, this is better than a naive cross join:

np.random.seed(0)
dd = pd.DataFrame({'value':np.random.randint(100000, size=50_000)})
df = pd.DataFrame({'start':np.random.randint(100000, size=1_000), 
                   'end':np.random.randint(100000, size=1_000)})

dd.head()

   value
0  68268
1  43567
2  42613
3  45891
4  21243

df.head()

   start    end
0  71915  47005
1  64284  44913
2  13377  96626
3  75823  38673
4  29151    575


%%timeit
out = df.merge(dd, how='cross')
out.loc[(out.start < out.value) & (out.end > out.value)]
5.12 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
280 ms ± 5.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'), use_numba=True)
124 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

out = df.merge(dd, how='cross')
out = out.loc[(out.start < out.value) & (out.end > out.value)]
A = df.conditional_join(dd, ('start', 'value' ,'<'), ('end', 'value' ,'>'))
columns = A.columns.tolist()
A = A.sort_values(columns, ignore_index = True)
out = out.sort_values(columns, ignore_index = True)

A.equals(out)
True

Depending on the data size, you could get more performance when an equi join is present. In this case, pandas merge function is used, but the final data frame is delayed until the non-equi joins are computed. Let's look at data from here:

import pandas as pd
import numpy as np
import random
import datetime

def random_dt_bw(start_date,end_date):
    days_between = (end_date - start_date).days
    random_num_days = random.randrange(days_between)
    random_dt = start_date + datetime.timedelta(days=random_num_days)
    return random_dt

def generate_data(n=1000):
    items = [f"i_{x}" for x in range(n)]
    start_dates = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(n)]
    end_dates = [x + datetime.timedelta(days=random.randint(1,10)) for x in start_dates]
    
    offerDf = pd.DataFrame({"Item":items,
                            "StartDt":start_dates,
                            "EndDt":end_dates})
    
    transaction_items = [f"i_{random.randint(0,n)}" for x in range(5*n)]
    transaction_dt = [random_dt_bw(datetime.date(2020,1,1),datetime.date(2020,9,1)) for x in range(5*n)]
    sales_amt = [random.randint(0,1000) for x in range(5*n)]
    
    transactionDf = pd.DataFrame({"Item":transaction_items,"TransactionDt":transaction_dt,"Sales":sales_amt})

    return offerDf,transactionDf

offerDf,transactionDf = generate_data(n=100000)


offerDf = (offerDf
           .assign(StartDt = offerDf.StartDt.astype(np.datetime64), 
                   EndDt = offerDf.EndDt.astype(np.datetime64)
                  )
           )

transactionDf = transactionDf.assign(TransactionDt = transactionDf.TransactionDt.astype(np.datetime64))

# you can get more performance when using ints/datetimes
# in the equi join, compared to strings

offerDf = offerDf.assign(Itemr = offerDf.Item.str[2:].astype(int))

transactionDf = transactionDf.assign(Itemr = transactionDf.Item.str[2:].astype(int))

transactionDf.head()
      Item TransactionDt  Sales  Itemr
0  i_43407    2020-05-29    692  43407
1  i_95044    2020-07-22    964  95044
2  i_94560    2020-01-09    462  94560
3  i_11246    2020-02-26    690  11246
4  i_55974    2020-03-07    219  55974

offerDf.head()
  Item    StartDt      EndDt  Itemr
0  i_0 2020-04-18 2020-04-19      0
1  i_1 2020-02-28 2020-03-07      1
2  i_2 2020-03-28 2020-03-30      2
3  i_3 2020-08-03 2020-08-13      3
4  i_4 2020-05-26 2020-06-04      4

# merge on strings 
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
                        (merged_df['TransactionDt']<=merged_df['EndDt'])]

# merge on ints ... usually faster
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &            
                        (merged_df['TransactionDt']<=merged_df['EndDt'])]

# merge on integers
cond_join_int = (transactionDf
                 .conditional_join(
                     offerDf, 
                     ('Itemr', 'Itemr', '=='), 
                     ('TransactionDt', 'StartDt', '>='), 
                     ('TransactionDt', 'EndDt', '<=')
                  )
                 )

# merge on strings
cond_join_str = (transactionDf
                 .conditional_join(
                     offerDf, 
                     ('Item', 'Item', '=='), 
                     ('TransactionDt', 'StartDt', '>='), 
                     ('TransactionDt', 'EndDt', '<=')
                  )
                )

%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Item')
classic_str = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
                        (merged_df['TransactionDt']<=merged_df['EndDt'])]
292 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
merged_df = pd.merge(offerDf,transactionDf,on='Itemr')
classic_int = merged_df[(merged_df['TransactionDt']>=merged_df['StartDt']) &
                        (merged_df['TransactionDt']<=merged_df['EndDt'])]
253 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit 
(transactionDf
.conditional_join(
    offerDf, 
    ('Item', 'Item', '=='), 
    ('TransactionDt', 'StartDt', '>='), 
    ('TransactionDt', 'EndDt', '<=')
   )
)
256 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit 
(transactionDf
.conditional_join(
    offerDf, 
    ('Itemr', 'Itemr', '=='), 
    ('TransactionDt', 'StartDt', '>='), 
    ('TransactionDt', 'EndDt', '<=')
   )
)
71.8 ms ± 2.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# check that both dataframes are equal
cols = ['Item', 'TransactionDt', 'Sales', 'Itemr_y','StartDt', 'EndDt', 'Itemr_x']
cond_join_str = cond_join_str.drop(columns=('right', 'Item')).set_axis(cols, axis=1)

(cond_join_str
.sort_values(cond_join_str.columns.tolist())
.reset_index(drop=True)
.reindex(columns=classic_str.columns)
.equals(
    classic_str
    .sort_values(classic_str.columns.tolist())
    .reset_index(drop=True)
))

True
霓裳挽歌倾城醉 2025-01-28 08:19:15

我认为您应该将其包含在您的解释中,因为这是我经常看到的相关合并,我相信它被称为 cross-join 。这是当唯一df共享列的合并时发生的合并,并且简单地合并了2个DFS并排合并:

设置:

names1 = [{'A':'Jack', 'B':'Jill'}]

names2 = [{'C':'Tommy', 'D':'Tammy'}]

df1=pd.DataFrame(names1)
df2=pd.DataFrame(names2)
df_merged= pd.merge(df1.assign(X=1), df2.assign(X=1), on='X').drop('X', axis=1)

这将创建一个虚拟X列,在X上合并,然后将其放置以产生

DF_Merged:

      A     B      C      D
0  Jack  Jill  Tommy  Tammy

I think you should include this in your explanation as it is a relevant merge that I see fairly often, which is termed cross-join I believe. This is a merge that occurs when unique df's share no columns, and it simply merging 2 dfs side-by-side:

The setup:

names1 = [{'A':'Jack', 'B':'Jill'}]

names2 = [{'C':'Tommy', 'D':'Tammy'}]

df1=pd.DataFrame(names1)
df2=pd.DataFrame(names2)
df_merged= pd.merge(df1.assign(X=1), df2.assign(X=1), on='X').drop('X', axis=1)

This creates a dummy X column, merges on the X, and then drops it to produce

df_merged:

      A     B      C      D
0  Jack  Jill  Tommy  Tammy
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文