如何比较Python中的2个数据帧并突出显示差异？

发布于 2025-02-10 03:14:45 字数 1845 浏览 2 评论 0原文

我正在尝试比较两个文件，一个是XLS，另一个是CSV格式。

file1.xlsx（不是实际数据）

  Title Flag  Price  total    ...more columns
0     A    Y     12    300
1     B    N     15    700
2     C    N     18   1000
..
..
more rows

file2.csv（不是实际的数据），

  Title Flag  Price  total    ...more columns
0     E    Y      7    234
1     B    N     16    600
2     A    Y     12    300
3     C    N     17   1000
..
..
more rows

我使用了pandas并将这些文件移至数据框架。文件中没有唯一的列（制作ID），并且有700K记录可比较。我需要将文件1与文件2进行比较并显示差异。我尝试过几件事，但我没有像预期的那样获得异常值。

如果我将合并函数如下所示，我将获得仅用于文件1的值的输出。

diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')

输出我将获得

  Title Attention_Needed  Price  total
1     B                N     15    700
2     C                N     18   1000

此输出并未显示正确的diff作为标题“ e”的记录，

我也缺少我尝试使用熊猫合并

diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')

＆amp; 上述输出是

  Title Flag  Price  total       Exist
0     A    Y     12    300        both
1     B    N     15    700   left_only
2     C    N     18   1000   left_only
3     E    Y      7    234  right_only
4     B    N     16    600  right_only
5     C    N     17   1000  right_only

上述输出的问题是，它正在显示两个数据帧的记录，如果每个数据框中有1000个记录，将非常困难。

输出我正在寻找（差异），通过添加额外的列（“注释”），并提供消息作为匹配，确切的差异，新等，或在类似的行上，

  Title Flag  Price  total       Comments
0     A    Y     12    300        matching
1     B    N     15    700   Price, total different
2     C    N     18   1000   Price different
3     E    Y      7    234    New record

如果无法进行上述输出，然后，请建议是否有其他方法解决此问题。

PS：这是我在这里的第一个问题，所以请让我知道您是否需要更多详细信息。

原文

I am trying to compare 2 files one is in xls and other is in csv format.

File1.xlsx (not actual data)

  Title Flag  Price  total    ...more columns
0     A    Y     12    300
1     B    N     15    700
2     C    N     18   1000
..
..
more rows

File2.csv (not actual data)

  Title Flag  Price  total    ...more columns
0     E    Y      7    234
1     B    N     16    600
2     A    Y     12    300
3     C    N     17   1000
..
..
more rows

I used Pandas and moved those files to data frame. There is no unique columns(to make id) in the files and there are 700K records to compare. I need to compare File 1 with File 2 and show the differences. I have tried few things but I am not getting the outliers as expected.

If I use merge function as below, I am getting output with the values only for File 1.

diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')

output I am getting

  Title Attention_Needed  Price  total
1     B                N     15    700
2     C                N     18   1000

This output is not showing the correct diff as record with Title 'E' is missing

I also tried using panda merge

diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')

& output for above was

  Title Flag  Price  total       Exist
0     A    Y     12    300        both
1     B    N     15    700   left_only
2     C    N     18   1000   left_only
3     E    Y      7    234  right_only
4     B    N     16    600  right_only
5     C    N     17   1000  right_only

Problem with above output is it is showing records from both the data frames and it will be very difficult if there are 1000 of records in each data frame.

Output I am looking for (for differences) by adding extra column("Comments") and give message as matching, exact difference, new etc. or on the similar lines

  Title Flag  Price  total       Comments
0     A    Y     12    300        matching
1     B    N     15    700   Price, total different
2     C    N     18   1000   Price different
3     E    Y      7    234    New record

If above output can not be possible, then please suggest if there is any other way to solve this.

PS: This is my first question here, so please let me know if you need more details here.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一杆小烟枪 2025-02-17 03:14:45

DF1中的行DF2中在DF2中不可用的行，

df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

DF2中不可用，

df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']

如果您通过行区分，则

pd.concat([df1,df2]).drop_duplicates(keep=False)

如果每个DF都有相同的列，则应单独比较每个列

for col in data.columns:
  set(df1.col).symmetric_difference(df2.col)
  # WARNING: this way of getting column diffs likely won't keep row order
  # new row order will be [unique_elements_from_df1_REVERSED] concat [unique_elements_from_df2_REVERSED]

Rows in DF1 Which Are Not Available in DF2

df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='left_only']

Rows in DF2 Which Are Not Available in DF1

df = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only']

If you're differentiating by row not column

pd.concat([df1,df2]).drop_duplicates(keep=False)

If each df has the same columns and each column should be compared individually

for col in data.columns:
  set(df1.col).symmetric_difference(df2.col)
  # WARNING: this way of getting column diffs likely won't keep row order
  # new row order will be [unique_elements_from_df1_REVERSED] concat [unique_elements_from_df2_REVERSED]

回复收藏 0 原文

段念尘 2025-02-17 03:14:45

假设DF1（左）是我们被认为是原始记录的“真理之源”。

运行后，

diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')

将输出拆分为2 df。

df1 = diff_df[diff_df["Exist"] in ["both", "left_only"]]
df2 = diff_df[diff_df["Exist"] == "right_only"]

现在，如果您从DF1删除“存在”行，您将拥有评论“匹配”的记录。

假设您将“注释”列添加到DF1

您可以说DF2中的所有内容都是新记录，但这会忽略“价格/总体不同”。

如果您真的想要差异评论，那么现在是一个棘手的位置，“如何”真正取决于最重要的订单列（title＆gt; flag＆gt; ...），

在您拥有一个之后，它们有多重要（加权系统）确定的Wighting系统，您需要一种“评分”方法，该方法将比较两行，以查看它们基于确定的列排名的相似之处。

# distributes weight so first is heaviest, last is lightest, total weight = 100
# if i was good i'd do this with numpy not manually
def getWeights(l):
    weights = [0 for col in l]
    total = 100
    while total > 0:
        for i, e in enumerate(l):
            for j in range(i+1):
                weights[j] += 1
                total -= 1
    return weights


def scoreRows(row1, row2):
    s = 0
    for i, colName in enumerate(colRank):
        if row1[colName] == row2[colName]:
            s += weights[i]


colRank = ['title', 'flag'] 
weights = getWeights(colRank)

假设只有这2个问题，其余的则被视为对原始行的“修改”，

即如果DF2中的一行对于DF1中的任何行没有匹配的标题或标志，则该行是新

记录使一排新记录完全取决于您。

考虑它的另一种思考方式是，您需要确定是什么使DF2中的某一行与DF1中的某些行不同，而DF1中的一排则不是其他行，

如果DF1中有2行
行1：[1，2，3，4]
row2：[1，6，3，7]

，您想将这行与该DF进行比较
[1、6、5、4]

这一行与两个元素具有相同的第一个元素，第二个元素与Row2相同，而Row1的第四个元素相同。

那么哪一行与它有所不同？

如果这是一个问题，您不确定如何回答，考虑减少损失，并将DF1作为“好”记录和DF2作为“新”记录，

如果您坚持'Distrafers'评论，我们的下一步是通过构建分数表来滤除具有微小差异的真正新记录，

# to recap
# df1 has "both" and "left_only" records ("matching" comment)
# df2 has "right_only" records (new records and differing records)
rowScores = [] 
# list of lists
# each inner list index correlates to the index for df2
# inner lists are 

# made up of tuples
# each tuple first element is the actual row from df1 that is matched
# second element is the score for matching (out of 100)

for i, row1 in df2.itterrows(): 
    thisRowsScores = []
    #df2 first because they are what we are scoring
    for j, row2 in df1.iterrows():
        s = scoreRows(row1, row2)
        if s>0: # only save rows and scores that matter
            thisRowsScores.append((row2, s))
    # at this point, you can either leave the scoring as a table and have comments refer how different differences relate back to some row
    # or you can just keep the best score like i'll be doing

    #sort by score
    sortedRowScores = thisRowsScores.sort(key=lambda x: x[1], reverse=True)
    rowScores.append(sortedRowScores[0])
    # appends empty list if no good matches found in df1
    # alternatively, remove 'reversed' from above and index at -1

可以索引它以添加“不同”评论

我们保存行本身的原因是， DF1 现在已经添加了“匹配”评论，

因为DF2中的每一行都有一个分数并引用了DF1中最匹配的行，我们可以将注释编辑为DF1中的该行以列出具有不同值的列。

但是在这一点上，我觉得DF现在需要回到DF2的引用，以便记录和重视这些差异所指的实际上是可获取的。

lets assume df1 (left) is our "source of truth" for what's considered an original record.

after running

diff_df = df1.merge(df2, how = 'outer' ,indicator=True).query('_merge == "left_only"').drop(columns='_merge')

take the output and split it into 2 df's.

df1 = diff_df[diff_df["Exist"] in ["both", "left_only"]]
df2 = diff_df[diff_df["Exist"] == "right_only"]

Right now, if you drop the "exist" row from df1, you'll have records where the comment would be "matching".

Let's assume you add the 'comments' column to df1

you could say that everything in df2 is a new record, but that would disregard the "price/total different".

If you really want the difference comment, now is a tricky bit where the 'how' really depends on what order columns matter most (title > flag > ...) and how much they matter (weighting system)

After you have a wighting system determined, you need a 'scoring' method that will compare two rows in order to see how similar they are based on the column ranking you determine.

# distributes weight so first is heaviest, last is lightest, total weight = 100
# if i was good i'd do this with numpy not manually
def getWeights(l):
    weights = [0 for col in l]
    total = 100
    while total > 0:
        for i, e in enumerate(l):
            for j in range(i+1):
                weights[j] += 1
                total -= 1
    return weights


def scoreRows(row1, row2):
    s = 0
    for i, colName in enumerate(colRank):
        if row1[colName] == row2[colName]:
            s += weights[i]


colRank = ['title', 'flag'] 
weights = getWeights(colRank)

Let's say only these 2 matter and the rest are considered 'modifications' to an original row

That is to say, if a row in df2 doesn't have a matching title OR flag for ANY row in df1, that row is a new record

What makes a row a new record is completely up to you.

Another way of thinking about it is that you need to determine what makes some row in df2 'differ' from some row in df1 and not a different row in df1

if you have 2 rows in df1
row1: [1, 2, 3, 4]
row2: [1, 6, 3, 7]

and you want to compare this row against that df
[1, 6, 5, 4]

this row has the same first element as both, the same second element as row2, and the same 4th element of row1.

so which row does it differ from?

if this is a question you aren't sure how to answer, consider cutting losses and just keep df1 as "good" records and df2 as "new" records

if you're sticking with the 'differs' comment, our next step is to filter out truly new records from records that have slight differences by building a score table

# to recap
# df1 has "both" and "left_only" records ("matching" comment)
# df2 has "right_only" records (new records and differing records)
rowScores = [] 
# list of lists
# each inner list index correlates to the index for df2
# inner lists are 

# made up of tuples
# each tuple first element is the actual row from df1 that is matched
# second element is the score for matching (out of 100)

for i, row1 in df2.itterrows(): 
    thisRowsScores = []
    #df2 first because they are what we are scoring
    for j, row2 in df1.iterrows():
        s = scoreRows(row1, row2)
        if s>0: # only save rows and scores that matter
            thisRowsScores.append((row2, s))
    # at this point, you can either leave the scoring as a table and have comments refer how different differences relate back to some row
    # or you can just keep the best score like i'll be doing

    #sort by score
    sortedRowScores = thisRowsScores.sort(key=lambda x: x[1], reverse=True)
    rowScores.append(sortedRowScores[0])
    # appends empty list if no good matches found in df1
    # alternatively, remove 'reversed' from above and index at -1

The reason we save the row itself is so that it can be indexed by df1 in order to add a "differ" comments

At this point, lets just say that df1 already has the comments "matching" added to it

Now that each row in df2 has a score and reference to the row it matched best in df1, we can edit the comment to that row in df1 to list the columns with different values.

But at this point, I feel as though that df now needs a reference back to df2 so that the record and values those difference refer to are actually gettable.

回复收藏 0 原文

~没有更多了~