使用正则表达式拆分 pandas 数据框中的列
我的 pandas 数据框中有一列,需要使用多个分隔符将其拆分为多个列,而不会丢失分隔符。 pandas 数据框如下:
Unnamed: 0 Unnamed: 0.1 Sr. No. Project Code ... Total Amount Sanctioned (In Lakhs) Supported By Current Status of Project Component Details
0 0 0 1 28AP2000000002 ... 50.00 Through NABARD Completed COLD / CA STORAGECOLD STORAGE09350.00MT
1 1 1 2 28AP2000000001 ... 31.93 Through NABARD Completed COLD / CA STORAGECOLD STORAGE04800.00MT
2 2 2 3 28AP2000000004 ... 50.00 Through NABARD Completed COLD / CA STORAGECOLD STORAGE08480.00MT
3 3 3 4 28AP2000000003 ... 50.00 Through NABARD Completed COLD / CA STORAGECOLD STORAGE08650.00MT
4 4 4 5 28AP2000000005 ... 34.20 Through NABARD Completed COLD / CA STORAGECOLD STORAGE03420.00MT
... ... ... ... ... ... ... ... ... ...
2693 47 47 48 19AWB0004861 ... 7.29 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2694 48 48 49 19AWB0004860 ... 6.59 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2695 49 49 50 19AWB0004864 ... 12.86 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2696 50 50 51 19AWB0004865 ... 2.39 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2697 51 51 52 19AWB0004804 ... 4.34 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019...
要拆分的列是最后一列“组件详细信息”。原版的如何拆分列的例子如下:
如您所见,我必须将最后一列分成 5 个不同的列。我的方法是使用正则表达式来分隔一列又一列。例如,可以使用 fin = re.split(r"(RE|GE)", str, maxsplit=1) 将第一列与其余列分开,但我无法理解如何将其应用于整个 pandas 列。有人可以帮忙吗?
当前表单中的列是:
0 COLD / CA STORAGECOLD STORAGE09350.00MT
1 COLD / CA STORAGECOLD STORAGE04800.00MT
2 COLD / CA STORAGECOLD STORAGE08480.00MT
3 COLD / CA STORAGECOLD STORAGE08650.00MT
4 COLD / CA STORAGECOLD STORAGE03420.00MT
...
2693 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2694 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2695 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2696 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2697 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019..
我正在寻找的输出如下所示:
COLD / CA STORAGE COLD STORAGE 0 200 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 75 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 75 MT
COLD / CA STORAGE COLD STORAGE 1 75 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 17700 TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 17400 TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 21000 TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 3 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 3 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 3 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 5 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 5 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 10199 TONNES
I have a column in my pandas dataframe that needs to be split using multiple delimiters into multiple columns without the loss of delimiters. The pandas dataframe is as follows:
Unnamed: 0 Unnamed: 0.1 Sr. No. Project Code ... Total Amount Sanctioned (In Lakhs) Supported By Current Status of Project Component Details
0 0 0 1 28AP2000000002 ... 50.00 Through NABARD Completed COLD / CA STORAGECOLD STORAGE09350.00MT
1 1 1 2 28AP2000000001 ... 31.93 Through NABARD Completed COLD / CA STORAGECOLD STORAGE04800.00MT
2 2 2 3 28AP2000000004 ... 50.00 Through NABARD Completed COLD / CA STORAGECOLD STORAGE08480.00MT
3 3 3 4 28AP2000000003 ... 50.00 Through NABARD Completed COLD / CA STORAGECOLD STORAGE08650.00MT
4 4 4 5 28AP2000000005 ... 34.20 Through NABARD Completed COLD / CA STORAGECOLD STORAGE03420.00MT
... ... ... ... ... ... ... ... ... ...
2693 47 47 48 19AWB0004861 ... 7.29 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2694 48 48 49 19AWB0004860 ... 6.59 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2695 49 49 50 19AWB0004864 ... 12.86 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2696 50 50 51 19AWB0004865 ... 2.39 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2697 51 51 52 19AWB0004804 ... 4.34 NHB Completed POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019...
The column to be split is the last column 'Component Details'. The original example of how to split the column is as follows:
As you can see, I have to split the last column into 5 different columns. My approach was to use regular expression to separate one column after another. Like for example, the first column could be separated with the rest by using fin = re.split(r"(RE|GE)", str, maxsplit=1)
but I am not able to understand how to apply this for the entire pandas column. Can anybody help?
The column in the current form is:
0 COLD / CA STORAGECOLD STORAGE09350.00MT
1 COLD / CA STORAGECOLD STORAGE04800.00MT
2 COLD / CA STORAGECOLD STORAGE08480.00MT
3 COLD / CA STORAGECOLD STORAGE08650.00MT
4 COLD / CA STORAGECOLD STORAGE03420.00MT
...
2693 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2694 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2695 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2696 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2697 POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019..
The output that I'm looking for looks like:
COLD / CA STORAGE COLD STORAGE 0 200 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 75 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 75 MT
COLD / CA STORAGE COLD STORAGE 1 75 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
COLD / CA STORAGE COLD STORAGE 1 5000 MT
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 17700 TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 17400 TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 21000 TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 3 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 3 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 3 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 5 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 5 NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT NA 10199 TONNES
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为了模仿读取您的数据文件,我确实
查看了结构,我的猜测是类别部分是由固定宽度列确定的,因此为了读取数据,您需要使用固定字段读取器:
这会产生一个我已经获得了前 31 个字符指示的组件。数据框现在看起来像:
为了根据正则表达式进行分割,您需要有 pandas 版本 1.4 及更高版本。此外,在您的情况下,您还想保留匹配的模式。没有直接的方法可以得到这个,因此我编写了一个小函数来检索匹配模式的列表:
该函数返回数据系列中正则表达式的匹配列表。
首先根据 Na 的 2 位数字检索组件数量:
现在您可以拆分组件列(这会丢弃匹配的模式):
再次使用该函数获取单位并将其从重量列中删除:
最后,将所有内容放回一起:
结果如下所示:
In order to mimic reading your data file I do
Looking at the structure, my guess is that the part with category is determined by a fixed width column, therefore in order to read the data you need to use a fixed field reader:
This yields a dataframe where I have already obtained the component indicated by the first 31 characters. The data frame now looks like:
In order to split on a regular expression you need to have pandas version 1.4 and higher. Moreover, in your case you also want to keep the matched pattern. There is not a direct way to get this, therefore I wrote a small function to retrieve a list of matched patterns:
This function returns a list of matches of regex in a data series.
Start with retrieving the number of components based on a 2 digit number of Na:
Now you can split your component column (which throws away the matched pattenr):
Use the function again to get the units and remove them from the weight column:
Finally, put everything back together:
The result looks like: