- 我如何执行(
内部
|(<代码>左 | 右
| fult
) ofter
)与熊猫一起加入
?
- 合并后,如何添加NANS以丢失行?
- 合并后如何摆脱NAN?
- 我可以在索引上合并吗?
- 如何合并多个数据范围?
- 交叉加入pandas
-
合并
? 加入
? concat
? 更新
? WHO?什么?为什么?!
...还有更多。我已经看到这些反复出现的问题询问了大熊猫合并功能的各个方面。有关合并及其各种用例的大多数信息都在数十个措辞不好,无法搜索的帖子中分散。这里的目的是整理一些更重要的后代。
此Q&amp; a将是一系列有用的用户指南中的下一部分(请参阅有关枢纽和有关串联的这篇文章,我将访问,稍后将介绍)。
请注意,这篇文章是不是是文档,所以也请阅读!其中一些例子是从那里取的。
目录
以易于访问。
- How can I perform a (
INNER
| (LEFT
|RIGHT
|FULL
) OUTER
) JOIN
with pandas?
- How do I add NaNs for missing rows after a merge?
- How do I get rid of NaNs after merging?
- Can I merge on the index?
- How do I merge multiple DataFrames?
- Cross join with pandas
merge
? join
? concat
? update
? Who? What? Why?!
... and more. I've seen these recurring questions asking about various facets of the pandas merge functionality. Most of the information regarding merge and its various use cases today is fragmented across dozens of badly worded, unsearchable posts. The aim here is to collate some of the more important points for posterity.
This Q&A is meant to be the next installment in a series of helpful user guides on common pandas idioms (see this post on pivoting, and this post on concatenation, which I will be touching on, later).
Please note that this post is not meant to be a replacement for the documentation, so please read that as well! Some of the examples are taken from there.
Table of Contents
For ease of access.
发布评论
评论(8)
这篇文章旨在为读者提供与大熊猫的合并,如何使用它以及不使用它的介绍。
特别是,这是本文将要经历的内容:
基础知识 - 连接的类型(左,右,外,内)
(以及我在此线程上的其他帖子)将无法通过:
足够的谈话 - 只需告诉我如何使用
Merge
!设置&amp; 基础知识的
为了简单起见,
键列具有相同的名称(目前)。 内联机由
为执行内在加入,呼叫
这仅返回
左
和右
共享一个共同密钥的行(在本示例中,“ b”和“ d)。a左外' >,
如果您指定,则可以通过指定来指定
nans的位置。 ,然后仅使用
左右的键,而
右边的数据被NAN替换为
右外的JOIN ,或者替换 JOIN是
...。
了数据。右 取代
NAN
>由
。 。
中缺少行
两者 。
>左排出加入和右排除在两个步骤中加入。
对于左排出的联接,
以左外的连接的执行,然后过滤为
仅(不包括右边的所有内容),
在其中,在其中,
以及类似地,对于右式排除JOIN, 最后,
如果您需要进行合并,该合并仅保留左右的钥匙,而不是两者(IOW,执行 Anti-Join ),
则可以以类似的方式进行此操作 -
<强>键列的不同名称
如果键列的命名不同 - 例如,
right has
左
具有keyleft
,而键入
而不是键
- 然后您必须指定left_on
和right_on
作为参数,而不是on
:在
keyleft
上合并左
和键> keyright
从>右
,如果您仅想要keyleft
或键>键入
(但不是两个),则可以首先将索引设置为初步步。将其与命令的输出进行对比(即
lewd.cresge的输出(right2,left_on ='keyleft',right_on ='keyright',how ='innear'')
),您会注意到keyleft
缺少。您可以根据将哪个框架的索引设置为键来确定要保留的列。当执行一些外部联接操作时,这可能很重要。仅从
dataframes之一
中合并一个列
仅在合并之前的子集列:
如果您要执行左外连接,则更性能的解决方案将涉及
map
:如前所述,这与
在多个列上合并< /strong>
要在多个列上加入,请在上指定列表(或 left_on 和
right_on
,请及时及时)。或者,如果名称不同,
其他有用的
合并*
操作和函数将数据帧与索引上的系列合并:请参见这个答案。
除了
Merge
, dataframe.update anddataframe.com.bine_first
在某些情况下也用于使用另一个数据框来更新一个数据框。代码> 是订购加入的有用功能。
pd.merge_asof
(读:merge_asof)对于近似加入很有用。本节仅涵盖了基本知识,旨在仅促进您的食欲。有关更多示例和案例,请参见 on
merge
,加入
和concat
以及指向功能规范的链接。继续阅读
跳转跳到熊猫中的其他主题,以继续学习:继续学习:
合并基础 - 基本类型 - 加入的基本类型 **
基于Index的加入
对多个数据范围概括
交叉加入
*您在这里。
This post aims to give readers a primer on SQL-flavored merging with Pandas, how to use it, and when not to use it.
In particular, here's what this post will go through:
The basics - types of joins (LEFT, RIGHT, OUTER, INNER)
What this post (and other posts by me on this thread) will not go through:
Enough talk - just show me how to use
merge
!Setup & Basics
For the sake of simplicity, the key column has the same name (for now).
An INNER JOIN is represented by
To perform an INNER JOIN, call
merge
on the left DataFrame, specifying the right DataFrame and the join key (at the very least) as arguments.This returns only rows from
left
andright
which share a common key (in this example, "B" and "D).A LEFT OUTER JOIN, or LEFT JOIN is represented by
This can be performed by specifying
how='left'
.Carefully note the placement of NaNs here. If you specify
how='left'
, then only keys fromleft
are used, and missing data fromright
is replaced by NaN.And similarly, for a RIGHT OUTER JOIN, or RIGHT JOIN which is...
...specify
how='right'
:Here, keys from
right
are used, and missing data fromleft
is replaced by NaN.Finally, for the FULL OUTER JOIN, given by
specify
how='outer'
.This uses the keys from both frames, and NaNs are inserted for missing rows in both.
The documentation summarizes these various merges nicely:
Other JOINs - LEFT-Excluding, RIGHT-Excluding, and FULL-Excluding/ANTI JOINs
If you need LEFT-Excluding JOINs and RIGHT-Excluding JOINs in two steps.
For LEFT-Excluding JOIN, represented as
Start by performing a LEFT OUTER JOIN and then filtering to rows coming from
left
only (excluding everything from the right),Where,
And similarly, for a RIGHT-Excluding JOIN,
Lastly, if you are required to do a merge that only retains keys from the left or right, but not both (IOW, performing an ANTI-JOIN),
You can do this in similar fashion—
Different names for key columns
If the key columns are named differently—for example,
left
haskeyLeft
, andright
haskeyRight
instead ofkey
—then you will have to specifyleft_on
andright_on
as arguments instead ofon
:Avoiding duplicate key column in output
When merging on
keyLeft
fromleft
andkeyRight
fromright
, if you only want either of thekeyLeft
orkeyRight
(but not both) in the output, you can start by setting the index as a preliminary step.Contrast this with the output of the command just before (that is, the output of
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
), you'll noticekeyLeft
is missing. You can figure out what column to keep based on which frame's index is set as the key. This may matter when, say, performing some OUTER JOIN operation.Merging only a single column from one of the
DataFrames
For example, consider
If you are required to merge only "newcol" (without any of the other columns), you can usually just subset columns before merging:
If you're doing a LEFT OUTER JOIN, a more performant solution would involve
map
:As mentioned, this is similar to, but faster than
Merging on multiple columns
To join on more than one column, specify a list for
on
(orleft_on
andright_on
, as appropriate).Or, in the event the names are different,
Other useful
merge*
operations and functionsMerging a DataFrame with Series on index: See this answer.
Besides
merge
,DataFrame.update
andDataFrame.combine_first
are also used in certain cases to update one DataFrame with another.pd.merge_ordered
is a useful function for ordered JOINs.pd.merge_asof
(read: merge_asOf) is useful for approximate joins.This section only covers the very basics, and is designed to only whet your appetite. For more examples and cases, see the documentation on
merge
,join
, andconcat
as well as the links to the function specifications.Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins *
Index-based joins
Generalizing to multiple DataFrames
Cross join
*You are here.
pd.concat的补充视觉视图([DF0,DF1],Kwargs)
。请注意,Kwarg
axis = 0
或axis = 1
的含义不像df.mean()
或df那样直观。
.jpg“ alt =” pd.concat([DF0,df1])>
A supplemental visual view of
pd.concat([df0, df1], kwargs)
.Notice that, kwarg
axis=0
oraxis=1
's meaning is not as intuitive asdf.mean()
ordf.apply(func)
加入101
这些动画可能会更好地在视觉上解释您。
积分: garrick aden-buie tidyexplain repo
内在加入
外部或完整的JOIN或FULL JOIN或FULL JOIN
< a href =“ https://i.sstatic.net/dg8mw.gif” rel =“ noreferrer”>
右JOIN
左JOIN
Joins 101
These animations might be better to explain you visually.
Credits: Garrick Aden-Buie tidyexplain repo
Inner Join
Outer Join or Full Join
Right Join
Left Join
在此答案中,我将考虑以下实际示例:
pandas.concat
pandas。 dataframe.merge
从一个和另一个列的索引合并数据框。我们将在每种情况下使用不同的数据框。
1。 a>
使用以下
dataframes
具有相同的列名:price2018 带尺寸
(8784,5)
P>(8760,5)
可以使用
pandas.concat
,简单地从中导致具有size
(17544,5)的数据框架
如果一个人想清楚地了解发生的事情,它可以像这样的工作
(源)
2。 代码>
在本节中,我们将考虑一个特定情况:合并一个数据框的索引和另一个数据框的列。
假设一个人具有
geo
,带有54
列,是date
typedateTime64 [NS ]
。和dataFrame
Price
具有一列,其价格为Price>
,索引对应于日期(date
),以便合并它们,可以使用 .merge 如下所示
,其中
geo
和Price
是以前的数据框架。这将导致以下数据框架
In this answer, I will consider practical examples of:
pandas.concat
pandas.DataFrame.merge
to merge dataframes from the index of one and the column of another one.We will be using different dataframes for each of the cases.
1.
pandas.concat
Considering the following
DataFrames
with the same column names:Price2018 with size
(8784, 5)
Price2019 with size
(8760, 5)
One can combine them using
pandas.concat
, by simplyWhich results in a DataFrame with size
(17544, 5)
If one wants to have a clear picture of what happened, it works like this
(Source)
2.
pandas.DataFrame.merge
In this section, we will consider a specific case: merging the index of one dataframe and the column of another dataframe.
Let's say one has the dataframe
Geo
with54
columns, being one of the columns theDate
, which is of typedatetime64[ns]
.And the dataframe
Price
that has one column with the price namedPrice
, and the index corresponds to the dates (Date
)In order to merge them, one can use
pandas.DataFrame.merge
as followswhere
Geo
andPrice
are the previous dataframes.That results in the following dataframe
这篇文章将遍历以下主题:
MERGE
,JOIN
,concat
返回top
index index index index index
ind;博士
索引索引加入
setup&amp;基础知识
通常,索引上的内在连接将看起来像这样:
其他连接遵循类似的语法。
值得注意的替代方案
dataframe.join
默认值在索引上加入。dataframe.join
默认情况下左外连接,因此='innit'
在此处都是必要的。请注意,我需要指定
lsuffix
和rsuffix
参数,因为JOIN
否则会出现错误:由于列名是相同的。如果它们的命名不同,这将不是问题。
pd.concat
加入索引,可以一次加入两个或多个数据范围。默认情况下,它进行了完整的外部加入,因此
how ='inner'
在此处需要..有关
concat
的更多信息,请参见此帖子。列的索引加入
使用左右列的索引执行内部连接,您将使用
dataframe.merge
left_index = true 和
right_on = ...
。其他连接遵循类似的结构。请注意,只有
Merge
才能执行与列连接的索引。您可以在多个列上加入,前提是左侧的索引级别的数量等于右侧的列数。JOIN
和concat
无法混合合并。您需要使用dataframe.set_index
。有效地使用命名索引[pandas&gt; = 0.23]
如果您的索引是命名的,则从pandas&gt; = 0.23,
dataframe.merge
允许您将索引名称指定到(或必要时和 right_on )。对于上一个与左侧列索引合并的示例,您可以使用
left_on
使用左的索引名称:继续阅读
跳转跳转到Pandas合并101中的其他主题以继续学习:
合并基础知识 - 加入的基本类型
基于索引的加入 *
对多个dataframes的推广
交叉加入
*您在这里
This post will go through the following topics:
merge
,join
,concat
BACK TO TOP
Index-based joins
TL;DR
Index to index joins
Setup & Basics
Typically, an inner join on index would look like this:
Other joins follow similar syntax.
Notable Alternatives
DataFrame.join
defaults to joins on the index.DataFrame.join
does a LEFT OUTER JOIN by default, sohow='inner'
is necessary here.Note that I needed to specify the
lsuffix
andrsuffix
arguments sincejoin
would otherwise error out:Since the column names are the same. This would not be a problem if they were differently named.
pd.concat
joins on the index and can join two or more DataFrames at once. It does a full outer join by default, sohow='inner'
is required here..For more information on
concat
, see this post.Index to Column joins
To perform an inner join using index of left, column of right, you will use
DataFrame.merge
a combination ofleft_index=True
andright_on=...
.Other joins follow a similar structure. Note that only
merge
can perform index to column joins. You can join on multiple columns, provided the number of index levels on the left equals the number of columns on the right.join
andconcat
are not capable of mixed merges. You will need to set the index as a pre-step usingDataFrame.set_index
.Effectively using Named Index [pandas >= 0.23]
If your index is named, then from pandas >= 0.23,
DataFrame.merge
allows you to specify the index name toon
(orleft_on
andright_on
as necessary).For the previous example of merging with the index of left, column of right, you can use
left_on
with the index name of left:Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins*
Generalizing to multiple DataFrames
Cross join
* you are here
这篇文章将遍历以下主题:
Merge
在此处具有缺点)回到TOP
经常对多个数据帧进行推广
,当要合并多个数据帧时,会出现情况。天真地,这可以通过链接
合并
调用来完成:但是,对于许多数据范围,这很快就失控了。此外,可能有必要对未知数量的数据框架进行概括。
在这里,我介绍了
pd.concat
在 unique 键上以及dataframe.join.join
in Multi-Way Joins上 - 唯一键。首先,设置。Multiway合并在唯一键上
如果您的键(此处,键可以是列或索引)是唯一的,则可以使用
pd.concat
。请注意,pd.concat
在索引上加入dataframes 。省略
join ='inner'
用于完整的外部联接。请注意,您无法指定左或右外连接(如果需要这些,请使用JOIN
,如下所述)。多路与重复的密钥合并
concat
很快,但存在其缺点。它无法处理重复。在这种情况下,我们可以使用
加入
,因为它可以处理非唯一的键(请注意,join
在其索引上加入dataframes;它调用MERGE
在引擎盖下,除非另有说明,否则进行左外连接。继续阅读
跳到熊猫中的其他主题,合并101以继续学习:
合并基础 - 基本类型 -
基于索引的加入
对多个数据范围概括 *
交叉加入
*您在这里
This post will go through the following topics:
merge
has shortcomings here)BACK TO TOP
Generalizing to multiple DataFrames
Oftentimes, the situation arises when multiple DataFrames are to be merged together. Naively, this can be done by chaining
merge
calls:However, this quickly gets out of hand for many DataFrames. Furthermore, it may be necessary to generalise for an unknown number of DataFrames.
Here I introduce
pd.concat
for multi-way joins on unique keys, andDataFrame.join
for multi-way joins on non-unique keys. First, the setup.Multiway merge on unique keys
If your keys (here, the key could either be a column or an index) are unique, then you can use
pd.concat
. Note thatpd.concat
joins DataFrames on the index.Omit
join='inner'
for a FULL OUTER JOIN. Note that you cannot specify LEFT or RIGHT OUTER joins (if you need these, usejoin
, described below).Multiway merge on keys with duplicates
concat
is fast, but has its shortcomings. It cannot handle duplicates.In this situation, we can use
join
since it can handle non-unique keys (note thatjoin
joins DataFrames on their index; it callsmerge
under the hood and does a LEFT OUTER JOIN unless otherwise specified).Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames *
Cross join
* you are here
目前,熊猫不支持在合并语法中加入不平等。一个选项是 pyjanitor - 我是该库的贡献者:
这些列作为变量参数传递元组,每个元组由左数据框中的列,右数据帧的列和加入操作员,可以是
(&gt; ,,&lt;,&gt; = ,,&lt =,!=,!=, )
。在上面的示例中,由于列名称中的重叠,返回了多索引列。明智的是,这比幼稚的交叉加入要好:
根据数据大小,当出现Equi连接时,您可以获得更多的性能。在这种情况下,使用PANDAS合并函数,但最终数据框架被延迟,直到计算非Equi连接为止。让我们从在这里查看数据。
Pandas at the moment does not support inequality joins within the merge syntax; one option is with the conditional_join function from pyjanitor - I am a contributor to this library:
The columns are passed as a variable argument of tuples, each tuple comprising of a column from the left dataframe, column from the right dataframe, and the join operator, which can be any of
(>, <, >=, <=, !=)
. In the example above, a MultiIndex column is returned, because of overlaps in the column names.Performance wise, this is better than a naive cross join:
Depending on the data size, you could get more performance when an equi join is present. In this case, pandas merge function is used, but the final data frame is delayed until the non-equi joins are computed. Let's look at data from here:
我认为您应该将其包含在您的解释中,因为这是我经常看到的相关合并,我相信它被称为
cross-join
。这是当唯一df共享列的合并时发生的合并,并且简单地合并了2个DFS并排合并:设置:
这将创建一个虚拟X列,在X上合并,然后将其放置以产生
DF_Merged:
I think you should include this in your explanation as it is a relevant merge that I see fairly often, which is termed
cross-join
I believe. This is a merge that occurs when unique df's share no columns, and it simply merging 2 dfs side-by-side:The setup:
This creates a dummy X column, merges on the X, and then drops it to produce
df_merged: