Python正则表达式解析字符串并返回元组

发布于 2024-11-14 05:11:16 字数 933 浏览 2 评论 0原文

我已经得到了一些可以使用的字符串。每个代表一个数据集,由数据集的名称和相关统计数据组成。它们都具有以下形式:

s = "| 'TOMATOES_PICKED'       |   914 |   1397 |"

我正在尝试实现一个函数,该函数将解析字符串并返回数据集的名称、第一个数字和第二个数字。这些字符串有很多,每个字符串都有不同的名称和相关统计信息,因此我认为最好的方法是使用正则表达式。这是我到目前为止所得到的:

def extract_data2(s):
    import re
    name = re.search("'(.*?)'", s).group(1)
    n1 = re.search('\|(.*)\|', s)
    return name, n1

所以我已经阅读了一些正则表达式并弄清楚了如何返回名称。对于我正在使用的每个字符串,数据集的名称都以“ ”为界,这就是我找到名称的方式。那部分工作正常。我的问题是获取数字。

我现在的想法是尝试匹配前面有竖线(|)的模式,然后是任何内容(这就是我使用.*的原因) ,然后是另一个垂直条以尝试获取第一个数字。有谁知道我如何在 Python 中做到这一点?

我在上面的代码中对第一个数字所做的尝试基本上返回整个字符串作为我的输出,而我只想获取数字。

这个想法是,它将能够:

return name, n1, n2

以便当用户输入字符串时,它可以解析该字符串并返回重要信息。我注意到在尝试获取数字时,到目前为止它将以字符串形式返回数字。是否有办法将 n1 或 n2 作为数字返回?请注意,对于某些字符串,n1 和 n2 可以是整数,也可以是小数。

我对编程非常陌生,所以如果这个问题看起来很初级,我深表歉意,但我一直在非常努力地阅读和搜索与我的情况接近的答案,但没有运气。

I've been given some strings to work with. Each one represents a data set and consists of the data set's name and the associated statistics. They all have the following form:

s = "| 'TOMATOES_PICKED'       |   914 |   1397 |"

I'm trying to implement a function that will parse the string and return the name of the data set, the first number, and the second number. There are lots of these strings and each one has a different name and associated stats so I've figured the best way to do this is with regular expressions. Here's what I have so far:

def extract_data2(s):
    import re
    name = re.search("'(.*?)'", s).group(1)
    n1 = re.search('\|(.*)\|', s)
    return name, n1

So I've done a bit of reading on regular expressions and figured out how to return the name. For each of the strings that I'm working with, the name of the data set is bounded by ' ' so that's how I found the name. That part works fine. My problem is with getting the numbers.

What I'm thinking right now is to try to match a pattern that is preceded by a vertical bar (|), then anything (which is why I used .*), and followed by another vertical bar to try to get the first number. Does anyone know how I can do this in Python?

What I tried in the above code for the first number returns basically the whole string as my output, whereas I want to get just the number.

The idea is that it will be able to:

return name, n1, n2

so that when the user inputs a string, it can just parse up the string and return the important information. I've noticed in my attempts to get the numbers so far that it will return the number as a string. Is there anyway to return n1 or n2 as just a number? Note that for some of the strings n1 and n2 could be either integers or have a decimal.

I am very new to programming so I apologize if this question seems rudimentary, but I have been reading and searching quite diligently for answers that are close to my case with no luck.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

打小就很酷 2024-11-21 05:11:16

我将使用单个正则表达式来匹配整行,并将我想要的部分放在命名组中 ((?Pexampl*e))。

import re
def extract_data2(s):
    pattern = re.compile(r"""\|\s*                 # opening bar and whitespace
                             '(?P<name>.*?)'       # quoted name
                             \s*\|\s*(?P<n1>.*?)   # whitespace, next bar, n1
                             \s*\|\s*(?P<n2>.*?)   # whitespace, next bar, n2
                             \s*\|""", re.VERBOSE)
    match = pattern.match(s)
    
    name = match.group("name")
    n1 = float(match.group("n1"))
    n2 = float(match.group("n2"))
    
    return (name, n1, n2)

要将 n1n2 从字符串转换为数字,我使用 float 函数。 (如果它们只是整数,我将使用 int 函数。)

我使用了 re.VERBOSE 标志和原始多行字符串 (r""".. .""") 使正则表达式更易于阅读。

I would use a single regular expression to match the entire line, with the parts I want in named groups ((?P<name>exampl*e)).

import re
def extract_data2(s):
    pattern = re.compile(r"""\|\s*                 # opening bar and whitespace
                             '(?P<name>.*?)'       # quoted name
                             \s*\|\s*(?P<n1>.*?)   # whitespace, next bar, n1
                             \s*\|\s*(?P<n2>.*?)   # whitespace, next bar, n2
                             \s*\|""", re.VERBOSE)
    match = pattern.match(s)
    
    name = match.group("name")
    n1 = float(match.group("n1"))
    n2 = float(match.group("n2"))
    
    return (name, n1, n2)

To convert n1 and n2 from strings to numbers, I use the float function. (If they were only integers, I would use the int function.)

I used the re.VERBOSE flag and raw multiline strings (r"""...""") to make the regex easier to read.

凌乱心跳 2024-11-21 05:11:16

使用正则表达式:

#! /usr/bin/env python

import re

tests = [
"| 'TOMATOES_PICKED'                                  |       914 |       1397 |",
"| 'TOMATOES_FLICKED'                                 |     32914 |       1123 |",
"| 'TOMATOES_RIGGED'                                  |        14 |       1343 |",
"| 'TOMATOES_PICKELED'                                |         4 |         23 |"]

def parse (s):
    mo = re.match ("\\|\s*'([^']*)'\s*\\|\s*(\d*)\s*\\|\s*(\d*)\s*\\|", s)
    if mo: return mo.groups ()

for test in tests: print parse (test)

Using regex:

#! /usr/bin/env python

import re

tests = [
"| 'TOMATOES_PICKED'                                  |       914 |       1397 |",
"| 'TOMATOES_FLICKED'                                 |     32914 |       1123 |",
"| 'TOMATOES_RIGGED'                                  |        14 |       1343 |",
"| 'TOMATOES_PICKELED'                                |         4 |         23 |"]

def parse (s):
    mo = re.match ("\\|\s*'([^']*)'\s*\\|\s*(\d*)\s*\\|\s*(\d*)\s*\\|", s)
    if mo: return mo.groups ()

for test in tests: print parse (test)
A君 2024-11-21 05:11:16

尝试使用拆分。

s= "| 'TOMATOES_PICKED'                                  |       914 |       1397 |"
print map(lambda x:x.strip("' "),s.split('|'))[1:-1]
  • Split :将字符串转换为字符串列表
  • lambda 函数:删除空格和 '
  • 选择器:仅采用预期部分

Try using split.

s= "| 'TOMATOES_PICKED'                                  |       914 |       1397 |"
print map(lambda x:x.strip("' "),s.split('|'))[1:-1]
  • Split : transform your string into a list of string
  • lambda function : removes spaces and '
  • Selector : take only expected parts
远山浅 2024-11-21 05:11:16

不确定我是否正确理解了你,但试试这个:

import re

print re.findall(r'\b\w+\b', yourtext)

Not sure that i have correctly understood you but try this:

import re

print re.findall(r'\b\w+\b', yourtext)
快乐很简单 2024-11-21 05:11:16

我必须同意其他海报,他们说在字符串上使用 split() 方法。如果你给定的字符串是,

>> s = "| 'TOMATOES_PICKED'                          |       914 |       1397 |"

你只需分割字符串,瞧,你现在有一个列表,其中第二个位置是名称,并且后面的条目中有两个值,即

>> s_new = s.split()
>> s_new
['|', "'TOMATOES_PICKED'", '|', '914', '|', '1397', '|']

当然你也有“|”字符,但这似乎在您的数据集中是一致的,因此这不是一个需要处理的大问题。忽略它们即可。

I would have to agree with the other posters that said use the split() method on your strings. If your given string is,

>> s = "| 'TOMATOES_PICKED'                          |       914 |       1397 |"

You just split the string and voila, you now have a list with the name in the second position, and the two values in the following entries, i.e.

>> s_new = s.split()
>> s_new
['|', "'TOMATOES_PICKED'", '|', '914', '|', '1397', '|']

Of course you do also have the "|" character but that seems to be consistent in your data set so it isn't a big problem to deal with. Just ignore them.

耳根太软 2024-11-21 05:11:16

通过 pyparsing,您可以让解析器为您创建一个类似字典的结构,使用第一列值作为键,将后续值作为该键的值数组:

>>> from pyparsing import *
>>> s = "| 'TOMATOES_PICKED'                                  |       914 |       1397 |"
>>> VERT = Suppress('|')
>>> title = quotedString.setParseAction(removeQuotes)
>>> integer = Word(nums).setParseAction(lambda tokens:int(tokens[0]))
>>> entry = Group(VERT + title + VERT + integer + VERT + integer + VERT)
>>> entries = Dict(OneOrMore(entry))
>>> data = entries.parseString(s)
>>> data.keys()
['TOMATOES_PICKED']
>>> data['TOMATOES_PICKED']
([914, 1397], {})
>>> data['TOMATOES_PICKED'].asList()
[914, 1397]
>>> data['TOMATOES_PICKED'][0]
914
>>> data['TOMATOES_PICKED'][1]
1397

这已经理解了多个条目,因此您可以通过它是一个包含所有数据值的多行字符串,并且将为您构建一个单键数据结构。
(处理这种用管道分隔的表格数据是我最早的 pyparsing 应用程序之一。)

With pyparsing, you can have the parser create a dict-like structure for you, using the first column values as the keys, and the subsequent values as an array of values for that key:

>>> from pyparsing import *
>>> s = "| 'TOMATOES_PICKED'                                  |       914 |       1397 |"
>>> VERT = Suppress('|')
>>> title = quotedString.setParseAction(removeQuotes)
>>> integer = Word(nums).setParseAction(lambda tokens:int(tokens[0]))
>>> entry = Group(VERT + title + VERT + integer + VERT + integer + VERT)
>>> entries = Dict(OneOrMore(entry))
>>> data = entries.parseString(s)
>>> data.keys()
['TOMATOES_PICKED']
>>> data['TOMATOES_PICKED']
([914, 1397], {})
>>> data['TOMATOES_PICKED'].asList()
[914, 1397]
>>> data['TOMATOES_PICKED'][0]
914
>>> data['TOMATOES_PICKED'][1]
1397

This already comprehends multiple entries, so you can just pass it a single multiline string containing all of your data values, and a single keyed data structure will be built for you.
(Processing this kind of pipe-delimited tabular data was one of the earliest applications I had for pyparsing.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文