如何将高维元组列表转换为数据帧？

发布于 2025-01-18 08:30:50 字数 967 浏览 3 评论 0原文

我有一个表格，从中获取嵌入数据，它以包含 3000 个数字的元组形式包含在 Embedding 列中。现在，我必须将这些嵌入的索引映射到我的数据集。我使用此代码进行上述操作：

p_x = [p_embedding[p_embedding['p_id'] == int(pid)]['Embedding'] for pid in p_mapping]

编辑：添加示例数据编辑2：大小为300

p_embedding，元组大小为300：

p_id   embedding
100    (0.11757241, -0.23792185, 0.30370793...)
101    (-0.1045902, 0.27551234, -0.15883833...)
102    (-0.0038427562, 0.091357835, -0.029324641...)

带有索引映射的p_mapping：

{'100': 0,
 '101': 1,
 '102': 2}

这为我提供了我想要的包含正确顺序嵌入的列表，但它是仍然是一列元组的形式。前三行如下所示：

[Series([], Name: Embedding, dtype: object),
 2463    (-0.080065295, 0.085681394, 0.044956923, 0.078...
 Name: Embedding, dtype: object,
 2510    (0.19006088, 0.1552349, -0.028743511, -0.25197...
 Name: Embedding, dtype: object,

我想将此元组拆分为数据帧的单独列，但是当我执行 pd.DataFrame 时，我只得到包含所有 NAN 值的 3000 多列的 DF。这背后有什么原因吗？我必须更改列表的索引吗？

原文

I have a table from where I am getting my embedding data, it's contained in the Embedding column in the form of tuples having 3000 numbers. Now, I have to map the index of these embeddings to my dataset. I am using this code for the above:

p_x = [p_embedding[p_embedding['p_id'] == int(pid)]['Embedding'] for pid in p_mapping]

Edit: Adding sample data
Edit 2: Size is 300

p_embedding with tuple of size 300:

p_id   embedding
100    (0.11757241, -0.23792185, 0.30370793...)
101    (-0.1045902, 0.27551234, -0.15883833...)
102    (-0.0038427562, 0.091357835, -0.029324641...)

p_mapping with index mappings:

{'100': 0,
 '101': 1,
 '102': 2}

This gives me the list that I want containing the embeddings in the correct order, but it's still in the form of a tuple in one column. The first three rows are like this:

[Series([], Name: Embedding, dtype: object),
 2463    (-0.080065295, 0.085681394, 0.044956923, 0.078...
 Name: Embedding, dtype: object,
 2510    (0.19006088, 0.1552349, -0.028743511, -0.25197...
 Name: Embedding, dtype: object,

I want to split this tuple into separate columns of a dataframe, but when I do pd.DataFrame I just get a DF of 3000+ columns with all NAN values. Is there any reason behind this, do I have to change the index of the list?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

那一片橙海， 2025-01-25 08:30:51

试试这个：

import pandas as pd

df = pd.DataFrame({'p_id': ['A', 'B', 'C'], 
                   'embedding': [[('123', '456', '111', '111','123', '456', '111', '111','123', '456', '111', '111'),
                            ('124', '456', '111', '111','123', '456', '111', '111','123', '456', '111', '111'),
                            ('125', '456', '111', '111','123', '456', '111', '111','123', '456', '111', '111')],
                           [],
                           [('123', '555', '333''456', '111', '111','123', '456', '111', '111','123', '456', '111', '111')]]
                   })
final_rows = []

for index, row in df.iterrows():
    if not row.embedding:  
        final_rows.append(row.p_id)
    for tup in row.embedding:
        new_row = [row.p_id]
        vals = list(tup)
        new_row.extend(vals)
        final_rows.append(new_row)

df2 = pd.DataFrame(final_rows)

df2.add_prefix('col_')

给出：

col_0 col_1 col_2   col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10 col_11  \
0     A   123   456     111   111   123   456   111   111   123    456    111   
1     A   124   456     111   111   123   456   111   111   123    456    111   
2     A   125   456     111   111   123   456   111   111   123    456    111   
3     B  None  None    None  None  None  None  None  None  None   None   None   
4     C   123   555  333456   111   111   123   456   111   111    123    456   

  col_12 col_13  
0    111   None  
1    111   None  
2    111   None  
3   None   None  
4    111    111

Try this:

import pandas as pd

df = pd.DataFrame({'p_id': ['A', 'B', 'C'], 
                   'embedding': [[('123', '456', '111', '111','123', '456', '111', '111','123', '456', '111', '111'),
                            ('124', '456', '111', '111','123', '456', '111', '111','123', '456', '111', '111'),
                            ('125', '456', '111', '111','123', '456', '111', '111','123', '456', '111', '111')],
                           [],
                           [('123', '555', '333''456', '111', '111','123', '456', '111', '111','123', '456', '111', '111')]]
                   })
final_rows = []

for index, row in df.iterrows():
    if not row.embedding:  
        final_rows.append(row.p_id)
    for tup in row.embedding:
        new_row = [row.p_id]
        vals = list(tup)
        new_row.extend(vals)
        final_rows.append(new_row)

df2 = pd.DataFrame(final_rows)

df2.add_prefix('col_')

which gives:

col_0 col_1 col_2   col_3 col_4 col_5 col_6 col_7 col_8 col_9 col_10 col_11  \
0     A   123   456     111   111   123   456   111   111   123    456    111   
1     A   124   456     111   111   123   456   111   111   123    456    111   
2     A   125   456     111   111   123   456   111   111   123    456    111   
3     B  None  None    None  None  None  None  None  None  None   None   None   
4     C   123   555  333456   111   111   123   456   111   111    123    456   

  col_12 col_13  
0    111   None  
1    111   None  
2    111   None  
3   None   None  
4    111    111

回复收藏 0 原文

~没有更多了~