使用正则表达式拆分 pandas 数据框中的列

发布于 2025-01-10 21:48:25 字数 4723 浏览 0 评论 0原文

我的 pandas 数据框中有一列,需要使用多个分隔符将其拆分为多个列,而不会丢失分隔符。 pandas 数据框如下:

      Unnamed: 0  Unnamed: 0.1  Sr. No.    Project Code  ... Total Amount Sanctioned (In Lakhs)    Supported By Current Status of Project                                  Component Details
0              0             0        1  28AP2000000002  ...                              50.00  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE09350.00MT
1              1             1        2  28AP2000000001  ...                              31.93  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE04800.00MT
2              2             2        3  28AP2000000004  ...                              50.00  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE08480.00MT
3              3             3        4  28AP2000000003  ...                              50.00  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE08650.00MT
4              4             4        5  28AP2000000005  ...                              34.20  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE03420.00MT
...          ...           ...      ...             ...  ...                                ...             ...                       ...                                                ...
2693          47            47       48    19AWB0004861  ...                               7.29             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2694          48            48       49    19AWB0004860  ...                               6.59             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2695          49            49       50    19AWB0004864  ...                              12.86             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2696          50            50       51    19AWB0004865  ...                               2.39             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2697          51            51       52    19AWB0004804  ...                               4.34             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019...

要拆分的列是最后一列“组件详细信息”。原版的如何拆分列的例子如下: 输入图片此处描述

在此处输入图像描述

如您所见,我必须将最后一列分成 5 个不同的列。我的方法是使用正则表达式来分隔一列又一列。例如,可以使用 fin = re.split(r"(RE|GE)", str, maxsplit=1) 将第一列与其余列分开,但我无法理解如何将其应用于整个 pandas 列。有人可以帮忙吗?

当前表单中的列是:

0                 COLD / CA STORAGECOLD STORAGE09350.00MT
1                 COLD / CA STORAGECOLD STORAGE04800.00MT
2                 COLD / CA STORAGECOLD STORAGE08480.00MT
3                 COLD / CA STORAGECOLD STORAGE08650.00MT
4                 COLD / CA STORAGECOLD STORAGE03420.00MT
                              ...
2693    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2694    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2695    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2696    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2697    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019..

我正在寻找的输出如下所示:

COLD / CA STORAGE           COLD STORAGE    0   200     MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   75      MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   75      MT
COLD / CA STORAGE           COLD STORAGE    1   75      MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  17700   TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  17400   TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  21000   TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  3       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  3       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  3       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  5       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  5       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  10199   TONNES

I have a column in my pandas dataframe that needs to be split using multiple delimiters into multiple columns without the loss of delimiters. The pandas dataframe is as follows:

      Unnamed: 0  Unnamed: 0.1  Sr. No.    Project Code  ... Total Amount Sanctioned (In Lakhs)    Supported By Current Status of Project                                  Component Details
0              0             0        1  28AP2000000002  ...                              50.00  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE09350.00MT
1              1             1        2  28AP2000000001  ...                              31.93  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE04800.00MT
2              2             2        3  28AP2000000004  ...                              50.00  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE08480.00MT
3              3             3        4  28AP2000000003  ...                              50.00  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE08650.00MT
4              4             4        5  28AP2000000005  ...                              34.20  Through NABARD                 Completed            COLD / CA STORAGECOLD STORAGE03420.00MT
...          ...           ...      ...             ...  ...                                ...             ...                       ...                                                ...
2693          47            47       48    19AWB0004861  ...                               7.29             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2694          48            48       49    19AWB0004860  ...                               6.59             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2695          49            49       50    19AWB0004864  ...                              12.86             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2696          50            50       51    19AWB0004865  ...                               2.39             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2697          51            51       52    19AWB0004804  ...                               4.34             NHB                 Completed  POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019...

The column to be split is the last column 'Component Details'. The original example of how to split the column is as follows:
enter image description here

enter image description here

As you can see, I have to split the last column into 5 different columns. My approach was to use regular expression to separate one column after another. Like for example, the first column could be separated with the rest by using fin = re.split(r"(RE|GE)", str, maxsplit=1) but I am not able to understand how to apply this for the entire pandas column. Can anybody help?

The column in the current form is:

0                 COLD / CA STORAGECOLD STORAGE09350.00MT
1                 COLD / CA STORAGECOLD STORAGE04800.00MT
2                 COLD / CA STORAGECOLD STORAGE08480.00MT
3                 COLD / CA STORAGECOLD STORAGE08650.00MT
4                 COLD / CA STORAGECOLD STORAGE03420.00MT
                              ...
2693    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2694    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
2695    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2696    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
2697    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019..

The output that I'm looking for looks like:

COLD / CA STORAGE           COLD STORAGE    0   200     MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   75      MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   75      MT
COLD / CA STORAGE           COLD STORAGE    1   75      MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
COLD / CA STORAGE           COLD STORAGE    1   5000    MT
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  17700   TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  17400   TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  21000   TONNES
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  3       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  3       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  3       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  5       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  5       NOS
POST HARVEST INFRASTRUCTURE CONVEYOR BELT   NA  10199   TONNES

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

青巷忧颜 2025-01-17 21:48:25

为了模仿读取您的数据文件,我确实

import re
import pandas as pd
import io

text = """Component Details
              COLD / CA STORAGECOLD STORAGE09350.00MT
              COLD / CA STORAGECOLD STORAGE04800.00MT
              COLD / CA STORAGECOLD STORAGE08480.00MT
              COLD / CA STORAGECOLD STORAGE08650.00MT
              COLD / CA STORAGECOLD STORAGE03420.00MT
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019TONES
"""

查看了结构,我的猜测是类别部分是由固定宽度列确定的,因此为了读取数据,您需要使用固定字段读取器:

df = pd.read_fwf(io.StringIO(text), widths=[31, 100], skiprows=1, header=None)
df.columns = ["Category", "Component"]

这会产生一个我已经获得了前 31 个字符指示的组件。数据框现在看起来像:

                      Category                 Component
0            COLD / CA STORAGE    COLD STORAGE09350.00MT
1            COLD / CA STORAGE    COLD STORAGE04800.00MT
2            COLD / CA STORAGE    COLD STORAGE08480.00MT
3            COLD / CA STORAGE    COLD STORAGE08650.00MT
4            COLD / CA STORAGE    COLD STORAGE03420.00MT
5  POST HARVEST INFRASTRUCTURE    CONVEYOR BELTNA3.00NOS
6  POST HARVEST INFRASTRUCTURE    CONVEYOR BELTNA3.00NOS
7  POST HARVEST INFRASTRUCTURE    CONVEYOR BELTNA5.00NOS
8  POST HARVEST INFRASTRUCTURE    CONVEYOR BELTNA5.00NOS
9  POST HARVEST INFRASTRUCTURE  CONVEYOR BELTNA1019TONES

为了根据正则表达式进行分割,您需要有 pandas 版本 1.4 及更高版本。此外,在您的情况下,您还想保留匹配的模式。没有直接的方法可以得到这个,因此我编写了一个小函数来检索匹配模式的列表:

def get_match_list(regex, dataserie):
    matches = list()
    for line in dataserie:
        if match := re.search(regex, line):
            matches.append(match.group(0))
        else:
            matches.append(None)
    return matches

该函数返回数据系列中正则表达式的匹配列表。

首先根据 Na 的 2 位数字检索组件数量:

df["number"] = get_match_list("NA|\d{2}", df["Component"])

现在您可以拆分组件列(这会丢弃匹配的模式):

details = df["Component"].str.split("NA|\d{2}", expand=True, regex=True, n=1)
details.columns = ["Component", "Weight"]

再次使用该函数获取单位并将其从重量列中删除:

details["unit"] = get_match_list("[A-Z]+", details["Weight"])
details["Weight"] = details["Weight"].str.replace("[A-Z]", "", regex=True)

最后,将所有内容放回一起:

total = pd.concat([df[["Category", "number"]], details], axis=1)
total = total[["Category", "Component", "number", "Weight", "unit"]]

结果如下所示:

                     Category      Component number  Weight   unit
0            COLD / CA STORAGE   COLD STORAGE     09  350.00     MT
1            COLD / CA STORAGE   COLD STORAGE     04  800.00     MT
2            COLD / CA STORAGE   COLD STORAGE     08  480.00     MT
3            COLD / CA STORAGE   COLD STORAGE     08  650.00     MT
4            COLD / CA STORAGE   COLD STORAGE     03  420.00     MT
5  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    3.00    NOS
6  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    3.00    NOS
7  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    5.00    NOS
8  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    5.00    NOS
9  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    1019  TONES

In order to mimic reading your data file I do

import re
import pandas as pd
import io

text = """Component Details
              COLD / CA STORAGECOLD STORAGE09350.00MT
              COLD / CA STORAGECOLD STORAGE04800.00MT
              COLD / CA STORAGECOLD STORAGE08480.00MT
              COLD / CA STORAGECOLD STORAGE08650.00MT
              COLD / CA STORAGECOLD STORAGE03420.00MT
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA3.00NOS
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA5.00NOS
    POST HARVEST INFRASTRUCTURECONVEYOR BELTNA1019TONES
"""

Looking at the structure, my guess is that the part with category is determined by a fixed width column, therefore in order to read the data you need to use a fixed field reader:

df = pd.read_fwf(io.StringIO(text), widths=[31, 100], skiprows=1, header=None)
df.columns = ["Category", "Component"]

This yields a dataframe where I have already obtained the component indicated by the first 31 characters. The data frame now looks like:

                      Category                 Component
0            COLD / CA STORAGE    COLD STORAGE09350.00MT
1            COLD / CA STORAGE    COLD STORAGE04800.00MT
2            COLD / CA STORAGE    COLD STORAGE08480.00MT
3            COLD / CA STORAGE    COLD STORAGE08650.00MT
4            COLD / CA STORAGE    COLD STORAGE03420.00MT
5  POST HARVEST INFRASTRUCTURE    CONVEYOR BELTNA3.00NOS
6  POST HARVEST INFRASTRUCTURE    CONVEYOR BELTNA3.00NOS
7  POST HARVEST INFRASTRUCTURE    CONVEYOR BELTNA5.00NOS
8  POST HARVEST INFRASTRUCTURE    CONVEYOR BELTNA5.00NOS
9  POST HARVEST INFRASTRUCTURE  CONVEYOR BELTNA1019TONES

In order to split on a regular expression you need to have pandas version 1.4 and higher. Moreover, in your case you also want to keep the matched pattern. There is not a direct way to get this, therefore I wrote a small function to retrieve a list of matched patterns:

def get_match_list(regex, dataserie):
    matches = list()
    for line in dataserie:
        if match := re.search(regex, line):
            matches.append(match.group(0))
        else:
            matches.append(None)
    return matches

This function returns a list of matches of regex in a data series.

Start with retrieving the number of components based on a 2 digit number of Na:

df["number"] = get_match_list("NA|\d{2}", df["Component"])

Now you can split your component column (which throws away the matched pattenr):

details = df["Component"].str.split("NA|\d{2}", expand=True, regex=True, n=1)
details.columns = ["Component", "Weight"]

Use the function again to get the units and remove them from the weight column:

details["unit"] = get_match_list("[A-Z]+", details["Weight"])
details["Weight"] = details["Weight"].str.replace("[A-Z]", "", regex=True)

Finally, put everything back together:

total = pd.concat([df[["Category", "number"]], details], axis=1)
total = total[["Category", "Component", "number", "Weight", "unit"]]

The result looks like:

                     Category      Component number  Weight   unit
0            COLD / CA STORAGE   COLD STORAGE     09  350.00     MT
1            COLD / CA STORAGE   COLD STORAGE     04  800.00     MT
2            COLD / CA STORAGE   COLD STORAGE     08  480.00     MT
3            COLD / CA STORAGE   COLD STORAGE     08  650.00     MT
4            COLD / CA STORAGE   COLD STORAGE     03  420.00     MT
5  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    3.00    NOS
6  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    3.00    NOS
7  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    5.00    NOS
8  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    5.00    NOS
9  POST HARVEST INFRASTRUCTURE  CONVEYOR BELT     NA    1019  TONES
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文