如何将数字从列隔离并创建3列？

发布于 2025-01-20 14:20:31 字数 2066 浏览 5 评论 0原文

我正在尝试访问列，过滤其数字，然后在3列中拆分。但是我只是遇到错误。我正在尝试这个：

dsc = df["Descricao"].str.findall("\d+")
dsc

The Output:
0                   []
1       [475, 2000, 3]
2        [65, 2000, 2]
3        [51, 2000, 3]
4       [320, 2000, 3]
             ...      
2344               NaN
2345    [480, 2000, 1]
2346     [32, 2000, 6]
2347    [250, 2000, 1]
2348               NaN
Name: Descricao, Length: 2349, dtype: object

然后，我正在尝试拆分，每次得到这种错误时：

df[['Larg','comp', 'qtd']] = dsc.str.split(',',expand=True)
df.head(5)

The Error:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_15388/2481153233.py in <module>
----> 1 df[['Larg','comp', 'qtd']] = dsc.str.split(',',expand=True)
      2 df.head(5)

~\anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   3598             self._setitem_frame(key, value)
   3599         elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3600             self._setitem_array(key, value)
   3601         elif isinstance(value, DataFrame):
   3602             self._set_item_frame_value(key, value)

~\anaconda3\lib\site-packages\pandas\core\frame.py in _setitem_array(self, key, value)
   3637         else:
   3638             if isinstance(value, DataFrame):
-> 3639                 check_key_length(self.columns, key, value)
   3640                 for k1, k2 in zip(key, value.columns):
   3641                     self[k1] = value[k2]

~\anaconda3\lib\site-packages\pandas\core\indexers.py in check_key_length(columns, key, value)
    426     if columns.is_unique:
    427         if len(value.columns) != len(key):
--> 428             raise ValueError("Columns must be same length as key")
    429     else:
    430         # Missing keys in columns are represented as -1

ValueError: Columns must be same length as key

我认为与str.findall生成列表有关。有人知道如何解决这个问题吗？有关信息，我的所有列都是对象。

原文

I am trying to access a column, filter its numbers and then split in 3 columns. But i have been only getting errors. I am trying this:

dsc = df["Descricao"].str.findall("\d+")
dsc

The Output:
0                   []
1       [475, 2000, 3]
2        [65, 2000, 2]
3        [51, 2000, 3]
4       [320, 2000, 3]
             ...      
2344               NaN
2345    [480, 2000, 1]
2346     [32, 2000, 6]
2347    [250, 2000, 1]
2348               NaN
Name: Descricao, Length: 2349, dtype: object

Then, I am trying to split and everytime i get this kind of error:

df[['Larg','comp', 'qtd']] = dsc.str.split(',',expand=True)
df.head(5)

The Error:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_15388/2481153233.py in <module>
----> 1 df[['Larg','comp', 'qtd']] = dsc.str.split(',',expand=True)
      2 df.head(5)

~\anaconda3\lib\site-packages\pandas\core\frame.py in __setitem__(self, key, value)
   3598             self._setitem_frame(key, value)
   3599         elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3600             self._setitem_array(key, value)
   3601         elif isinstance(value, DataFrame):
   3602             self._set_item_frame_value(key, value)

~\anaconda3\lib\site-packages\pandas\core\frame.py in _setitem_array(self, key, value)
   3637         else:
   3638             if isinstance(value, DataFrame):
-> 3639                 check_key_length(self.columns, key, value)
   3640                 for k1, k2 in zip(key, value.columns):
   3641                     self[k1] = value[k2]

~\anaconda3\lib\site-packages\pandas\core\indexers.py in check_key_length(columns, key, value)
    426     if columns.is_unique:
    427         if len(value.columns) != len(key):
--> 428             raise ValueError("Columns must be same length as key")
    429     else:
    430         # Missing keys in columns are represented as -1

ValueError: Columns must be same length as key

I think there is something to do with str.findall generating a list of lists.
Does anybody know how can I solve this?
For information, all my columns are objects.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心安伴我暖 2025-01-27 14:20:32

您可以尝试以下操作：

dsc = pd.DataFrame(df["Descricao"].str.findall("\d+").tolist(), columns=['Larg','comp', 'qtd'])

df = pd.concat([df, dsc], axis=1)

请注意，如果任何时候都有三个以上的列，这可能不起作用（考虑到您的尝试，情况并非如此）。

该方法来自在这里。

You could try this:

dsc = pd.DataFrame(df["Descricao"].str.findall("\d+").tolist(), columns=['Larg','comp', 'qtd'])

df = pd.concat([df, dsc], axis=1)

Note that this may not work if there are more than three columns at any point (I assume this will not be the case, given your attempt).

This method came from here.

回复收藏 0 原文

流星番茄 2025-01-27 14:20:32

一般情况下，某些输入可能没有解析为 3 个数值的字符串。

这里有一种方法可以完成问题所要求的操作，同时用 NaN 填充任何异常行的新列。如果非标准行的所需行为不同，则可以根据需要调整逻辑。

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'Descricao' : ['', '475,2000,3', '65,2000,2', np.nan, 'abc,def,ghi', '1,2', '1']
})
print(f"\nInput dataframe:\n{df}")

df2 = df["Descricao"].str.findall("\d+").to_frame()
print(f"\nDataframe with lists of 3 where possible:\n{df2}")

df2["Descricao"] = df2.apply(lambda x: 
    x["Descricao"] 
        if (len(x["Descricao"]) if isinstance(x["Descricao"], list) else 0) == 3 else 
    [np.NaN]*3, 
    axis=1)
print(f"\nDataframe with lists include NaNs for incomplete data:\n{df2}")

df2[['Larg','comp', 'qtd']] = pd.DataFrame(df2["Descricao"].tolist(), columns=['Larg','comp', 'qtd'])
df2 = df2.drop(['Descricao'], axis=1)
print(f"\nResult dataframe with NaNs for incomplete inputs:\n{df2}")

示例输出：


Input dataframe:
     Descricao
0
1   475,2000,3
2    65,2000,2
3          NaN
4  abc,def,ghi
5          1,2
6            1

Dataframe with lists of 3 where possible:
        Descricao
0              []
1  [475, 2000, 3]
2   [65, 2000, 2]
3             NaN
4              []
5          [1, 2]
6             [1]

Dataframe with lists include NaNs for incomplete data:
         Descricao
0  [nan, nan, nan]
1   [475, 2000, 3]
2    [65, 2000, 2]
3  [nan, nan, nan]
4  [nan, nan, nan]
5  [nan, nan, nan]
6  [nan, nan, nan]

Result dataframe with NaNs for incomplete inputs:
  Larg  comp  qtd
0  NaN   NaN  NaN
1  475  2000    3
2   65  2000    2
3  NaN   NaN  NaN
4  NaN   NaN  NaN
5  NaN   NaN  NaN
6  NaN   NaN  NaN

In the general case, some of the inputs may not have strings that parse to 3 numerical values.

Here is a way to do what the question asks while filling the new columns for any unusual rows with NaNs. If the desired behavior for non-standard rows is different, the logic can be adjusted as needed.

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'Descricao' : ['', '475,2000,3', '65,2000,2', np.nan, 'abc,def,ghi', '1,2', '1']
})
print(f"\nInput dataframe:\n{df}")

df2 = df["Descricao"].str.findall("\d+").to_frame()
print(f"\nDataframe with lists of 3 where possible:\n{df2}")

df2["Descricao"] = df2.apply(lambda x: 
    x["Descricao"] 
        if (len(x["Descricao"]) if isinstance(x["Descricao"], list) else 0) == 3 else 
    [np.NaN]*3, 
    axis=1)
print(f"\nDataframe with lists include NaNs for incomplete data:\n{df2}")

df2[['Larg','comp', 'qtd']] = pd.DataFrame(df2["Descricao"].tolist(), columns=['Larg','comp', 'qtd'])
df2 = df2.drop(['Descricao'], axis=1)
print(f"\nResult dataframe with NaNs for incomplete inputs:\n{df2}")

Sample Output:


Input dataframe:
     Descricao
0
1   475,2000,3
2    65,2000,2
3          NaN
4  abc,def,ghi
5          1,2
6            1

Dataframe with lists of 3 where possible:
        Descricao
0              []
1  [475, 2000, 3]
2   [65, 2000, 2]
3             NaN
4              []
5          [1, 2]
6             [1]

Dataframe with lists include NaNs for incomplete data:
         Descricao
0  [nan, nan, nan]
1   [475, 2000, 3]
2    [65, 2000, 2]
3  [nan, nan, nan]
4  [nan, nan, nan]
5  [nan, nan, nan]
6  [nan, nan, nan]

Result dataframe with NaNs for incomplete inputs:
  Larg  comp  qtd
0  NaN   NaN  NaN
1  475  2000    3
2   65  2000    2
3  NaN   NaN  NaN
4  NaN   NaN  NaN
5  NaN   NaN  NaN
6  NaN   NaN  NaN

回复收藏 0 原文

半夏半凉 2025-01-27 14:20:32

谢谢大家！遵循@ConstantStranger解决方案，该解决方案的零件并开发了新版本。但这很容易开始。最后，我的解决方案是：

dsc = ndf['descricao'].str.findall('\d+')        #Separated only the numerical elements
# Created 3 lists for the elements
larg = []
comp = []
qtd = []
for lines in dsc:                    
    for item in enumerate(lines):
        if len(lines) != 3:       #If the length of the elements is not 3, does nothing.
            continue
        if item[0] == 0:
            larg.append(item[1])
        if item[0] == 1:
            comp.append(item[1])
        if item[0] == 2:
            qtd.append(item[1])
#Then i checked for the length of all
print(len(larg), len(comp), len(qtd))

lis = [larg, comp, qtd]
df1 = pd.DataFrame(lis).transpose()
df1.columns = ['larg', 'comp', 'qtd']
df1

输出：

    larg    comp    qtd
0   32  2000    6
1   46  1000    1
2   320 100 20
3   220 100 50
4   220 50  30
... ... ... ...
1404    50  2000    1
1405    52  200 2
1406    48  2000    1
1407    325 3000    1
1408    33  2000    2
1409 rows × 3 columns

我想，这不是大数据的理想解决方案，但目前正在起作用。我尝试使用to_frame（）尝试了.findall表达式，但是由于某种原因，每个长度都归功于零。
因此，现在我将寻找一种优化的方法。

Thank You all! Following the @constantstranger solution, a part from it solution and developed a new version. But it was an easy start. At the end, my solution was:

dsc = ndf['descricao'].str.findall('\d+')        #Separated only the numerical elements
# Created 3 lists for the elements
larg = []
comp = []
qtd = []
for lines in dsc:                    
    for item in enumerate(lines):
        if len(lines) != 3:       #If the length of the elements is not 3, does nothing.
            continue
        if item[0] == 0:
            larg.append(item[1])
        if item[0] == 1:
            comp.append(item[1])
        if item[0] == 2:
            qtd.append(item[1])
#Then i checked for the length of all
print(len(larg), len(comp), len(qtd))

lis = [larg, comp, qtd]
df1 = pd.DataFrame(lis).transpose()
df1.columns = ['larg', 'comp', 'qtd']
df1

The Output:

    larg    comp    qtd
0   32  2000    6
1   46  1000    1
2   320 100 20
3   220 100 50
4   220 50  30
... ... ... ...
1404    50  2000    1
1405    52  200 2
1406    48  2000    1
1407    325 3000    1
1408    33  2000    2
1409 rows × 3 columns

I guess, it's not the ideal solution for big data but it's working for now. I tried the .findall expression with to_frame(), but for some reason every length went to zero.
So, now i'll be looking for a way to optimize.

回复收藏 0 原文

~没有更多了~