框架
elemento_lista |
item_id |
得分 |
7 |
的 |
这样 |
777 |
100 |
, |
691791 |
了 |
|
100 |
数据 |
691789 |
6 |
我 |
距离 |
算法 |
Levenshtein |
后 |
应用 |
: |
获得 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
691776 |
6 |
691789 |
100 |
5 |
691777 |
6 |
691789 |
100 |
7 |
691791 |
7 |
691791 |
100 |
4 |
691776 |
7 |
691791 |
100 |
5 |
691777 |
7 |
691791 |
100 |
6 |
691789 |
9 |
1407402 |
100 |
10 |
1407424 |
10 |
1407424 |
100 |
9 |
1407402 |
Elemento_lista column is the index of the element that is compared to others ,,,,
item_id是元素的ID,
得分是由算法产生的得分,
IDX是所发现的元素的索引(与Elemento_lista相同,但对于被发现的元素相似),
Item_id_coincidencia是元素的ID,与
真实DF的小样本(超过300000行)相似,我需要掉落的线相同,例如...如果elemento_lista 4等于IDX 5,6和7 ...它们都是相同的,所以我不需要5等线等于4、6和7/6等于4,5,7,而7等于4,等于4 5,6。每个elemento_lista:value = 9等于IDX 10,所以...我不需要行elemento_lista 10等于IDX 9 ...我如何丢弃这些行以减少df len? ?
最终DF应该是:
Elemento_lista |
Item_id |
得分 |
IDX ITX ITX |
ITEM_ID_COINCIDENCIA |
4 |
691776 |
100 |
5 |
691777 |
4 |
691776 |
100 |
6 |
691789 |
4 |
691776 |
100 |
7 |
691791 |
9 |
1407402 |
100 |
10 14010 |
1407424 |
我不知道...
提前致谢
after applying levenshtein distance algorithm I get a dataframe like this:
Elemento_lista |
Item_ID |
Score |
idx |
ITEM_ID_Coincidencia |
4 |
691776 |
100 |
5 |
691777 |
4 |
691776 |
100 |
6 |
691789 |
4 |
691776 |
100 |
7 |
691791 |
5 |
691777 |
100 |
4 |
691776 |
5 |
691777 |
100 |
6 |
691789 |
5 |
691777 |
100 |
7 |
691791 |
6 |
691789 |
100 |
4 |
691776 |
6 |
691789 |
100 |
5 |
691777 |
6 |
691789 |
100 |
7 |
691791 |
7 |
691791 |
100 |
4 |
691776 |
7 |
691791 |
100 |
5 |
691777 |
7 |
691791 |
100 |
6 |
691789 |
9 |
1407402 |
100 |
10 |
1407424 |
10 |
1407424 |
100 |
9 |
1407402 |
Elemento_lista column is the index of the element that is compared to others,
Item_ID is the id of the element,
Score is Score generated by the algorithm,
idx is the index of the element that was found as similar (same as Elemento_lista , but for elements that were found as similar),
ITEM_ID_Coincidencia is the id of the element found as similar
It´s a small sample of the real DF (More than 300000 rows), I´ll need to drop lines that are the same , for example...if Elemento_lista 4, is equal to idx 5,6,and 7...they are all the same, so I don't need lines where 5 is equal to 4, 6 and 7/ 6 is equal to 4,5,7 and 7 is equal to 4,5,6. The same for each Elemento_Lista : value=9 is equal to idx 10, so...I don't need the line Elemento_Lista 10 is equal to idx 9...How could I drop these lines in order to reduce DF len ???
Final DF should be:
Elemento_lista |
Item_ID |
Score |
idx |
ITEM_ID_Coincidencia |
4 |
691776 |
100 |
5 |
691777 |
4 |
691776 |
100 |
6 |
691789 |
4 |
691776 |
100 |
7 |
691791 |
9 |
1407402 |
100 |
10 |
1407424 |
I don´t know how to do this...is it possible?
Thanks in advance
发布评论
评论(2)
准备诸如示例之类的数据:
现在,您插入一列:它将包含一个2个排序索引的数组。
然后,所有数据帧都是由插入的列对排序的:
然后,您消除了重复插入列的行:
最后,您消除插入的列:'tuple_of_indexes':
输出为:
Preparing data like example:
Now, you insert one column: it will contain an array of 2 sorted indexes.
Then all the dataframe is sorted by the inserted column:
Then you eliminate rows that repeat inserted column:
For last, you eliminate inserted column: 'tuple_of_indexes':
The output is:
这可以使用图理论来处理。
您的ID之间有以下关系:
因此,您需要做的就是找到子图。
为此,我们可以使用
networkx
'sconnected_components
真实数据是超度连接的,仅在Fine 中形成仅2组。
您可以在此处更改策略并使用有向图和
strongly_connected_components
过滤
df2
上的新图:This can be approached using graph theory.
You have the following relationships between your IDs:
So what you need to do is find the subgraphs.
For this we can use
networkx
'sconnected_components
function:output:
update: real data
You real data is hyperconnected, forming in fine only 2 groups.
You can change strategy here and use a directed graph and
strongly_connected_components
new graph on the filtered
df2
: