逐对将相同的值放在不同的列中(滴连接的组件)

发布于 2025-01-28 16:03:07 字数 4943 浏览 3 评论 0 原文

框架

elemento_lista item_id 得分 7
这样 777 100 691791
100 数据 691789
6 距离 算法 Levenshtein
应用 获得
4 691776
6 691789 100 5 691777
6 691789 100 7 691791
7 691791 100 4 691776
7 691791 100 5 691777
7 691791 100 6 691789
9 1407402 100 10 1407424
10 1407424 100 9 1407402

Elemento_lista column is the index of the element that is compared to others ,,,, item_id是元素的ID, 得分是由算法产生的得分, IDX是所发现的元素的索引(与Elemento_lista相同,但对于被发现的元素相似), Item_id_coincidencia是元素的ID,与

真实DF的小样本(超过300000行)相似,我需要掉落的线相同,例如...如果elemento_lista 4等于IDX 5,6和7 ...它们都是相同的,所以我不需要5等线等于4、6和7/6等于4,5,7,而7等于4,等于4 5,6。每个elemento_lista:value = 9等于IDX 10,所以...我不需要行elemento_lista 10等于IDX 9 ...我如何丢弃这些行以减少df len? ?

最终DF应该是:

Elemento_lista Item_id 得分 IDX ITX ITX ITEM_ID_COINCIDENCIA
4 691776 100 5 691777
4 691776 100 6 691789
4 691776 100 7 691791
9 1407402 100 10 14010 1407424

我不知道...

提前致谢

after applying levenshtein distance algorithm I get a dataframe like this:

Elemento_lista Item_ID Score idx ITEM_ID_Coincidencia
4 691776 100 5 691777
4 691776 100 6 691789
4 691776 100 7 691791
5 691777 100 4 691776
5 691777 100 6 691789
5 691777 100 7 691791
6 691789 100 4 691776
6 691789 100 5 691777
6 691789 100 7 691791
7 691791 100 4 691776
7 691791 100 5 691777
7 691791 100 6 691789
9 1407402 100 10 1407424
10 1407424 100 9 1407402

Elemento_lista column is the index of the element that is compared to others,
Item_ID is the id of the element,
Score is Score generated by the algorithm,
idx is the index of the element that was found as similar (same as Elemento_lista , but for elements that were found as similar),
ITEM_ID_Coincidencia is the id of the element found as similar

It´s a small sample of the real DF (More than 300000 rows), I´ll need to drop lines that are the same , for example...if Elemento_lista 4, is equal to idx 5,6,and 7...they are all the same, so I don't need lines where 5 is equal to 4, 6 and 7/ 6 is equal to 4,5,7 and 7 is equal to 4,5,6. The same for each Elemento_Lista : value=9 is equal to idx 10, so...I don't need the line Elemento_Lista 10 is equal to idx 9...How could I drop these lines in order to reduce DF len ???

Final DF should be:

Elemento_lista Item_ID Score idx ITEM_ID_Coincidencia
4 691776 100 5 691777
4 691776 100 6 691789
4 691776 100 7 691791
9 1407402 100 10 1407424

I don´t know how to do this...is it possible?

Thanks in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦中楼上月下 2025-02-04 16:03:07

准备诸如示例之类的数据:

a = [
[4,691776,100,5,691777],
[4,691776,100,6,691789],
[4,691776,100,7,691791],
[5,691777,100,4,691776],
[5,691777,100,6,691789],
[5,691777,100,7,691791],
[6,691789,100,4,691776],
[6,691789,100,5,691777],
[6,691789,100,7,691791],
[7,691791,100,4,691776],
[7,691791,100,5,691777],
[7,691791,100,6,691789],
[9,1407402,100,10,1407424],
[10,1407424,100,9,1407402]
]
c = ['Elemento_lista', 'Item_ID', 'Score', 'idx', 'ITEM_ID_Coincidencia']
df = pd.DataFrame(data = a, columns = c)
df

现在,您插入一列:它将包含一个2个排序索引的数组。

tuples_of_indexes = [sorted([x[0], x[3]]) for x in df.values]
df.insert(5, 'tuple_of_indexes', (tuples_of_indexes))

然后,所有数据帧都是由插入的列对排序的:

df = df.sort_values(by=['tuple_of_indexes'])

然后,您消除了重复插入列的行:

df = df[~df['tuple_of_indexes'].apply(tuple).duplicated()]

最后,您消除插入的列:'tuple_of_indexes':

df.drop(['tuple_of_indexes'], axis=1)

输出为:

Elemento_lista  Item_ID Score   idx ITEM_ID_Coincidencia
0   4   691776  100 5   691777
1   4   691776  100 6   691789
2   4   691776  100 7   691791
4   5   691777  100 6   691789
5   5   691777  100 7   691791
8   6   691789  100 7   691791
12  9   1407402 100 10  1407424

”

Preparing data like example:

a = [
[4,691776,100,5,691777],
[4,691776,100,6,691789],
[4,691776,100,7,691791],
[5,691777,100,4,691776],
[5,691777,100,6,691789],
[5,691777,100,7,691791],
[6,691789,100,4,691776],
[6,691789,100,5,691777],
[6,691789,100,7,691791],
[7,691791,100,4,691776],
[7,691791,100,5,691777],
[7,691791,100,6,691789],
[9,1407402,100,10,1407424],
[10,1407424,100,9,1407402]
]
c = ['Elemento_lista', 'Item_ID', 'Score', 'idx', 'ITEM_ID_Coincidencia']
df = pd.DataFrame(data = a, columns = c)
df

Now, you insert one column: it will contain an array of 2 sorted indexes.

tuples_of_indexes = [sorted([x[0], x[3]]) for x in df.values]
df.insert(5, 'tuple_of_indexes', (tuples_of_indexes))

Then all the dataframe is sorted by the inserted column:

df = df.sort_values(by=['tuple_of_indexes'])

Then you eliminate rows that repeat inserted column:

df = df[~df['tuple_of_indexes'].apply(tuple).duplicated()]

For last, you eliminate inserted column: 'tuple_of_indexes':

df.drop(['tuple_of_indexes'], axis=1)

The output is:

Elemento_lista  Item_ID Score   idx ITEM_ID_Coincidencia
0   4   691776  100 5   691777
1   4   691776  100 6   691789
2   4   691776  100 7   691791
4   5   691777  100 6   691789
5   5   691777  100 7   691791
8   6   691789  100 7   691791
12  9   1407402 100 10  1407424

output result

仅冇旳回忆 2025-02-04 16:03:07

这可以使用图理论来处理。

您的ID之间有以下关系:

”

因此,您需要做的就是找到子图。

为此,我们可以使用 networkx 's connected_components

# pip install networkx
import networkx as nx
G = nx.from_pandas_edgelist(df, source='Elemento_lista', target='idx')

# get "first" (arbitrary) node for each subgraph
# note that sets (unsorted) are used
# so there is no guarantee on any node being "first" item
nodes = [tuple(g)[0] for g in nx.connected_components(G) if g]
# [4, 9]

# filter DataFrame
df2 = df[df['Elemento_lista'].isin(nodes)]

    Elemento_lista  Item_ID  Score  idx  ITEM_ID_Coincidencia
0                4   691776    100    5                691777
1                4   691776    100    6                691789
2                4   691776    100    7                691791
12               9  1407402    100   10               1407424

真实数据是超度连接的,仅在Fine 中形成仅2组。

您可以在此处更改策略并使用有向图和 strongly_connected_components

import networkx as nx
#df = pd.read_csv('ADIDAS_CALZADO.csv', index_col=0)
G = nx.from_pandas_edgelist(df, source='Elemento_lista', target='idx', create_using=nx.DiGraph)

# len(list(nx.strongly_connected_components(G)))
# 150 subgraphs

nodes = [tuple(g)[0] for g in nx.strongly_connected_components(G) if g]

df2 = df[df['Elemento_lista'].isin(nodes)]

# len(df2)
# only 2,910 nodes left out of the 25,371 initial ones

过滤 df2 上的新图:

”

This can be approached using graph theory.

You have the following relationships between your IDs:

graph

So what you need to do is find the subgraphs.

For this we can use networkx's connected_components function:

# pip install networkx
import networkx as nx
G = nx.from_pandas_edgelist(df, source='Elemento_lista', target='idx')

# get "first" (arbitrary) node for each subgraph
# note that sets (unsorted) are used
# so there is no guarantee on any node being "first" item
nodes = [tuple(g)[0] for g in nx.connected_components(G) if g]
# [4, 9]

# filter DataFrame
df2 = df[df['Elemento_lista'].isin(nodes)]

output:

    Elemento_lista  Item_ID  Score  idx  ITEM_ID_Coincidencia
0                4   691776    100    5                691777
1                4   691776    100    6                691789
2                4   691776    100    7                691791
12               9  1407402    100   10               1407424

update: real data

You real data is hyperconnected, forming in fine only 2 groups.

real data

You can change strategy here and use a directed graph and strongly_connected_components

import networkx as nx
#df = pd.read_csv('ADIDAS_CALZADO.csv', index_col=0)
G = nx.from_pandas_edgelist(df, source='Elemento_lista', target='idx', create_using=nx.DiGraph)

# len(list(nx.strongly_connected_components(G)))
# 150 subgraphs

nodes = [tuple(g)[0] for g in nx.strongly_connected_components(G) if g]

df2 = df[df['Elemento_lista'].isin(nodes)]

# len(df2)
# only 2,910 nodes left out of the 25,371 initial ones

new graph on the filtered df2:

filtered graph

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文