如何在Python上从文本数据中分离特定数字

发布于 2025-01-12 16:12:19 字数 703 浏览 0 评论 0原文

我有一个来自 pandas 的数据框:

id     adress

0     Jame Homie Street. N:60 5555242424 La
1     London. 2322325234243 Stw St. N 8 St.bridge
2     32424244234 ddd st. ss Sk. N 63 Manchester
3     Mou st 147 Rochester Liv 33424245223

我想将数字分开(例如 5555242424 ,2322325234243 , 32424244234 ,33424245223 )并创建一个新功能。

示例输出:

id     adress                                           number

0     Jame Homie Street. N:60 La                      5555242424 
1     London. Stw St. N 8 St.bridge                   2322325234243 
2     ddd st. ss Sk. N 63 Manchester                  32424244234 
3     Mou st 147 Rochester Liv                        3424245223

I have a dataframe from pandas :

id     adress

0     Jame Homie Street. N:60 5555242424 La
1     London. 2322325234243 Stw St. N 8 St.bridge
2     32424244234 ddd st. ss Sk. N 63 Manchester
3     Mou st 147 Rochester Liv 33424245223

I want to separate that is the numbers(like 5555242424 ,2322325234243 , 32424244234 ,33424245223 )and create a new feature.

Sample output :

id     adress                                           number

0     Jame Homie Street. N:60 La                      5555242424 
1     London. Stw St. N 8 St.bridge                   2322325234243 
2     ddd st. ss Sk. N 63 Manchester                  32424244234 
3     Mou st 147 Rochester Liv                        3424245223

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

弄潮 2025-01-19 16:12:20

假设您想要提取第一个至少 4 位数字(因此在示例中它会忽略 60、8、63、147),您可以使用:

df_payers["number"] = df_payers["adress"].str.extract("(\d{4,})")
df_payers["adress"] = df_payers["adress"].str.replace("(\d{4,})","",regex=True)

>>> df_payers
   id                           adress         number
0   0      Jame Homie Street. N:60  La     5555242424
1   1   London.  Stw St. N 8 St.bridge  2322325234243
2   2   ddd st. ss Sk. N 63 Manchester    32424244234
3   3        Mou st 147 Rochester Liv     33424245223

Assuming you want to extract the first number that has at least 4 digits (so it ignores 60, 8, 63, 147 in your example), you can use:

df_payers["number"] = df_payers["adress"].str.extract("(\d{4,})")
df_payers["adress"] = df_payers["adress"].str.replace("(\d{4,})","",regex=True)

>>> df_payers
   id                           adress         number
0   0      Jame Homie Street. N:60  La     5555242424
1   1   London.  Stw St. N 8 St.bridge  2322325234243
2   2   ddd st. ss Sk. N 63 Manchester    32424244234
3   3        Mou st 147 Rochester Liv     33424245223
如此安好 2025-01-19 16:12:20

列表理解以长度 3 与其他数字分开。如果你想增加的话可以在那里改变。

df = pd.DataFrame({
    "adress":["Jame Homie Street. N:60 5555242424 La","London. 2322325234243 Stw St. N 8 St.bridge",
    "32424244234 ddd st. ss Sk. N 63 Manchester","Mou st 147 Rochester Liv 33424245223"],
})

cleanedAdress = []
numbers = []
for i in df.values:
    tempSplit = i[0].split()
    numericEx = [s for s in tempSplit if s.isdigit() if len(s) > 3]
    strEx = ''.join(numericEx)
    numbers.append(strEx)

    tempSplit.remove(strEx)
    tempSplit = ' '.join(tempSplit)
    cleanedAdress.append(tempSplit)

dfCleaned = pd.DataFrame({"adress":cleanedAdress,"numbers":numbers})

dfCleaned

                           adress        numbers
0      Jame Homie Street. N:60 La     5555242424
1   London. Stw St. N 8 St.bridge  2322325234243
2  ddd st. ss Sk. N 63 Manchester    32424244234
3        Mou st 147 Rochester Liv    33424245223

List comprehension with split at length 3 from other digits. You can change there if you want to increase.

df = pd.DataFrame({
    "adress":["Jame Homie Street. N:60 5555242424 La","London. 2322325234243 Stw St. N 8 St.bridge",
    "32424244234 ddd st. ss Sk. N 63 Manchester","Mou st 147 Rochester Liv 33424245223"],
})

cleanedAdress = []
numbers = []
for i in df.values:
    tempSplit = i[0].split()
    numericEx = [s for s in tempSplit if s.isdigit() if len(s) > 3]
    strEx = ''.join(numericEx)
    numbers.append(strEx)

    tempSplit.remove(strEx)
    tempSplit = ' '.join(tempSplit)
    cleanedAdress.append(tempSplit)

dfCleaned = pd.DataFrame({"adress":cleanedAdress,"numbers":numbers})

dfCleaned

                           adress        numbers
0      Jame Homie Street. N:60 La     5555242424
1   London. Stw St. N 8 St.bridge  2322325234243
2  ddd st. ss Sk. N 63 Manchester    32424244234
3        Mou st 147 Rochester Liv    33424245223
蹲墙角沉默 2025-01-19 16:12:20

如果您知道所有地址模式,则可以使用一些正则表达式来提取值。

由于在示例中您提供的每一行都与其他行完全不同,因此您可以做的就是依靠 addr 数字长度来构建单个正则表达式,然后将其与其余行分开。

import re

raw_addrs = """0     Jame Homie Street. N:60 5555242424 La
1     London. 2322325234243 Stw St. N 8 St.bridge
2     32424244234 ddd st. ss Sk. N 63 Manchester
3     Mou st 147 Rochester Liv 33424245223""".split('\n')

id_addrs_regex = r'^(?P<id>\d+)\s+(?P<addr>.*)

输出是:

[('0', 'Jame Homie Street. N:60 La', '5555242424'),
 ('1', 'London. Stw St. N 8 St.bridge', '2322325234243'),
 ('2', 'ddd st. ss Sk. N 63 Manchester', '32424244234'),
 ('3', 'Mou st 147 Rochester Liv', '33424245223')]

id_addrs = [(match.group('id'), match.group('addr')) for match in data] number_re = r'\d{6,}' numbers = [re.search(number_re, addr).group() for _, addr in id_addrs] output = [(id_addr[0], ' '.join(id_addr[1].replace(number, "").split()), number) for id_addr, number in zip(id_addrs, numbers)]

输出是:

If you know all the addresses patterns you can use some regular expressions in order to extract the values.

Since in the example you provided each line is totally different from the others, something you can do is to rely on the addr number length to build a single regex and then split this from the rest.

import re

raw_addrs = """0     Jame Homie Street. N:60 5555242424 La
1     London. 2322325234243 Stw St. N 8 St.bridge
2     32424244234 ddd st. ss Sk. N 63 Manchester
3     Mou st 147 Rochester Liv 33424245223""".split('\n')

id_addrs_regex = r'^(?P<id>\d+)\s+(?P<addr>.*)

The output is:

[('0', 'Jame Homie Street. N:60 La', '5555242424'),
 ('1', 'London. Stw St. N 8 St.bridge', '2322325234243'),
 ('2', 'ddd st. ss Sk. N 63 Manchester', '32424244234'),
 ('3', 'Mou st 147 Rochester Liv', '33424245223')]

id_addrs = [(match.group('id'), match.group('addr')) for match in data] number_re = r'\d{6,}' numbers = [re.search(number_re, addr).group() for _, addr in id_addrs] output = [(id_addr[0], ' '.join(id_addr[1].replace(number, "").split()), number) for id_addr, number in zip(id_addrs, numbers)]

The output is:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文