Python 中的 sscanf

发布于 2024-08-20 00:11:11 字数 710 浏览 13 评论 0原文

我正在寻找Python中与sscanf()等效的函数。我想解析 /proc/net/* 文件,在 CI 中可以做这样的事情:

int matches = sscanf(
        buffer,
        "%*d: %64[0-9A-Fa-f]:%X %64[0-9A-Fa-f]:%X %*X %*X:%*X %*X:%*X %*X %*d %*d %ld %*512s\n",
        local_addr, &local_port, rem_addr, &rem_port, &inode);

我一开始想使用 str.split,但是它不会拆分给定的字符,但 sep 字符串作为一个整体:

>>> lines = open("/proc/net/dev").readlines()
>>> for l in lines[2:]:
>>>     cols = l.split(string.whitespace + ":")
>>>     print len(cols)
1

应该返回 17,如上所述。

是否有相当于 sscanf 的 Python(不是 RE),或者标准库中的字符串分割函数可以分割我不知道的任何字符范围?

I'm looking for an equivalent to sscanf() in Python. I want to parse /proc/net/* files, in C I could do something like this:

int matches = sscanf(
        buffer,
        "%*d: %64[0-9A-Fa-f]:%X %64[0-9A-Fa-f]:%X %*X %*X:%*X %*X:%*X %*X %*d %*d %ld %*512s\n",
        local_addr, &local_port, rem_addr, &rem_port, &inode);

I thought at first to use str.split, however it doesn't split on the given characters, but the sep string as a whole:

>>> lines = open("/proc/net/dev").readlines()
>>> for l in lines[2:]:
>>>     cols = l.split(string.whitespace + ":")
>>>     print len(cols)
1

Which should be returning 17, as explained above.

Is there a Python equivalent to sscanf (not RE), or a string splitting function in the standard library that splits on any of a range of characters that I'm not aware of?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

把人绕傻吧 2024-08-27 00:11:11

还有 parse 模块。

parse() 的设计与 format() (Python 2.6 及更高版本中较新的字符串格式化函数)相反。

>>> from parse import parse
>>> parse('{} fish', '1')
>>> parse('{} fish', '1 fish')
<Result ('1',) {}>
>>> parse('{} fish', '2 fish')
<Result ('2',) {}>
>>> parse('{} fish', 'red fish')
<Result ('red',) {}>
>>> parse('{} fish', 'blue fish')
<Result ('blue',) {}>

There is also the parse module.

parse() is designed to be the opposite of format() (the newer string formatting function in Python 2.6 and higher).

>>> from parse import parse
>>> parse('{} fish', '1')
>>> parse('{} fish', '1 fish')
<Result ('1',) {}>
>>> parse('{} fish', '2 fish')
<Result ('2',) {}>
>>> parse('{} fish', 'red fish')
<Result ('red',) {}>
>>> parse('{} fish', 'blue fish')
<Result ('blue',) {}>
心舞飞扬 2024-08-27 00:11:11

当我使用 C 语言时,我通常使用 zip 和列表推导式来实现类似 scanf 的行为。像这样:

input = '1 3.0 false hello'
(a, b, c, d) = [t(s) for t,s in zip((int,float,strtobool,str),input.split())]
print (a, b, c, d)

请注意,对于更复杂的格式字符串,您确实需要使用正则表达式:

import re
input = '1:3.0 false,hello'
(a, b, c, d) = [t(s) for t,s in zip((int,float,strtobool,str),re.search('^(\d+):([\d.]+) (\w+),(\w+)

另请注意,您需要为要转换的所有类型提供转换函数。例如,上面我使用了类似的东西:

strtobool = lambda s: {'true': True, 'false': False}[s]
,input).groups())] print (a, b, c, d)

另请注意,您需要为要转换的所有类型提供转换函数。例如,上面我使用了类似的东西:

When I'm in a C mood, I usually use zip and list comprehensions for scanf-like behavior. Like this:

input = '1 3.0 false hello'
(a, b, c, d) = [t(s) for t,s in zip((int,float,strtobool,str),input.split())]
print (a, b, c, d)

Note that for more complex format strings, you do need to use regular expressions:

import re
input = '1:3.0 false,hello'
(a, b, c, d) = [t(s) for t,s in zip((int,float,strtobool,str),re.search('^(\d+):([\d.]+) (\w+),(\w+)

Note also that you need conversion functions for all types you want to convert. For example, above I used something like:

strtobool = lambda s: {'true': True, 'false': False}[s]
,input).groups())] print (a, b, c, d)

Note also that you need conversion functions for all types you want to convert. For example, above I used something like:

这样的小城市 2024-08-27 00:11:11

Python 没有等效的 sscanf 内置功能,大多数时候,通过直接处理字符串、使用正则表达式或使用解析来解析输入实际上更有意义工具。

可能对翻译 C 最有用,人们已经实现了 sscanf,例如在此模块中:http://hkn.eecs.berkeley.edu/~dyoo/python/scanf/

在这种特殊情况下,如果您只想根据多个分割字符分割数据,re.split 确实是正确的工具。

Python doesn't have an sscanf equivalent built-in, and most of the time it actually makes a whole lot more sense to parse the input by working with the string directly, using regexps, or using a parsing tool.

Probably mostly useful for translating C, people have implemented sscanf, such as in this module: http://hkn.eecs.berkeley.edu/~dyoo/python/scanf/

In this particular case if you just want to split the data based on multiple split characters, re.split is really the right tool.

咆哮 2024-08-27 00:11:11

您可以使用 re 模块分割一系列字符。

>>> import re
>>> r = re.compile('[ \t\n\r:]+')
>>> r.split("abc:def  ghi")
['abc', 'def', 'ghi']

You can split on a range of characters using the re module.

>>> import re
>>> r = re.compile('[ \t\n\r:]+')
>>> r.split("abc:def  ghi")
['abc', 'def', 'ghi']
呢古 2024-08-27 00:11:11

您可以使用命名组<来解析模块re /a>.它不会将子字符串解析为其实际数据类型(例如int),但在解析字符串时非常方便。

给定来自 /proc/net/tcp 的示例行:

line="   0: 00000000:0203 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 335 1 c1674320 300 0 0 0"

使用变量模仿 sscanf 示例的示例可能是:

import re
hex_digit_pattern = r"[\dA-Fa-f]"
pat = r"\d+: " + \
      r"(?P<local_addr>HEX+):(?P<local_port>HEX+) " + \
      r"(?P<rem_addr>HEX+):(?P<rem_port>HEX+) " + \
      r"HEX+ HEX+:HEX+ HEX+:HEX+ HEX+ +\d+ +\d+ " + \
      r"(?P<inode>\d+)"
pat = pat.replace("HEX", hex_digit_pattern)

values = re.search(pat, line).groupdict()

import pprint; pprint values
# prints:
# {'inode': '335',
#  'local_addr': '00000000',
#  'local_port': '0203',
#  'rem_addr': '00000000',
#  'rem_port': '0000'}

You can parse with module re using named groups. It won't parse the substrings to their actual datatypes (e.g. int) but it's very convenient when parsing strings.

Given this sample line from /proc/net/tcp:

line="   0: 00000000:0203 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 335 1 c1674320 300 0 0 0"

An example mimicking your sscanf example with the variable could be:

import re
hex_digit_pattern = r"[\dA-Fa-f]"
pat = r"\d+: " + \
      r"(?P<local_addr>HEX+):(?P<local_port>HEX+) " + \
      r"(?P<rem_addr>HEX+):(?P<rem_port>HEX+) " + \
      r"HEX+ HEX+:HEX+ HEX+:HEX+ HEX+ +\d+ +\d+ " + \
      r"(?P<inode>\d+)"
pat = pat.replace("HEX", hex_digit_pattern)

values = re.search(pat, line).groupdict()

import pprint; pprint values
# prints:
# {'inode': '335',
#  'local_addr': '00000000',
#  'local_port': '0203',
#  'rem_addr': '00000000',
#  'rem_port': '0000'}
迷途知返 2024-08-27 00:11:11

有一个 示例关于如何使用 libc 中的 sscanf 的官方 Python 文档

# import libc
from ctypes import CDLL
if(os.name=="nt"):
    libc = cdll.msvcrt 
else:
    # assuming Unix-like environment
    libc = cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")  # alternative

# allocate vars
i = c_int()
f = c_float()
s = create_string_buffer(b'\000' * 32)

# parse with sscanf
libc.sscanf(b"1 3.14 Hello", "%d %f %s", byref(i), byref(f), s)

# read the parsed values
i.value  # 1
f.value  # 3.14
s.value # b'Hello'

There is an example in the official python docs about how to use sscanf from libc:

# import libc
from ctypes import CDLL
if(os.name=="nt"):
    libc = cdll.msvcrt 
else:
    # assuming Unix-like environment
    libc = cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")  # alternative

# allocate vars
i = c_int()
f = c_float()
s = create_string_buffer(b'\000' * 32)

# parse with sscanf
libc.sscanf(b"1 3.14 Hello", "%d %f %s", byref(i), byref(f), s)

# read the parsed values
i.value  # 1
f.value  # 3.14
s.value # b'Hello'
想你的星星会说话 2024-08-27 00:11:11

您可以将“:”转为空格,然后进行 split.eg

>>> f=open("/proc/net/dev")
>>> for line in f:
...     line=line.replace(":"," ").split()
...     print len(line)

不需要正则表达式(对于本例)

you can turn the ":" to space, and do the split.eg

>>> f=open("/proc/net/dev")
>>> for line in f:
...     line=line.replace(":"," ").split()
...     print len(line)

no regex needed (for this case)

时光瘦了 2024-08-27 00:11:11

您可以安装 pandas 并使用 pandas.read_fwf 用于固定宽度格式文件。使用 /proc/net/arp 的示例:

In [230]: df = pandas.read_fwf("/proc/net/arp")

In [231]: print(df)
       IP address HW type Flags         HW address Mask Device
0   141.38.28.115     0x1   0x2  84:2b:2b:ad:e1:f4    *   eth0
1   141.38.28.203     0x1   0x2  c4:34:6b:5b:e4:7d    *   eth0
2   141.38.28.140     0x1   0x2  00:19:99:ce:00:19    *   eth0
3   141.38.28.202     0x1   0x2  90:1b:0e:14:a1:e3    *   eth0
4    141.38.28.17     0x1   0x2  90:1b:0e:1a:4b:41    *   eth0
5    141.38.28.60     0x1   0x2  00:19:99:cc:aa:58    *   eth0
6   141.38.28.233     0x1   0x2  90:1b:0e:8d:7a:c9    *   eth0
7    141.38.28.55     0x1   0x2  00:19:99:cc:ab:00    *   eth0
8   141.38.28.224     0x1   0x2  90:1b:0e:8d:7a:e2    *   eth0
9   141.38.28.148     0x1   0x0  4c:52:62:a8:08:2c    *   eth0
10  141.38.28.179     0x1   0x2  90:1b:0e:1a:4b:50    *   eth0

In [232]: df["HW address"]
Out[232]:
0     84:2b:2b:ad:e1:f4
1     c4:34:6b:5b:e4:7d
2     00:19:99:ce:00:19
3     90:1b:0e:14:a1:e3
4     90:1b:0e:1a:4b:41
5     00:19:99:cc:aa:58
6     90:1b:0e:8d:7a:c9
7     00:19:99:cc:ab:00
8     90:1b:0e:8d:7a:e2
9     4c:52:62:a8:08:2c
10    90:1b:0e:1a:4b:50

In [233]: df["HW address"][5]
Out[233]: '00:19:99:cc:aa:58'

默认情况下,它会尝试自动找出格式,但是您可以提供一些选项以获得更明确的说明(请参阅 文档)。 pandas 中还有其他 IO 例程,它们对于其他文件格式。

You could install pandas and use pandas.read_fwf for fixed width format files. Example using /proc/net/arp:

In [230]: df = pandas.read_fwf("/proc/net/arp")

In [231]: print(df)
       IP address HW type Flags         HW address Mask Device
0   141.38.28.115     0x1   0x2  84:2b:2b:ad:e1:f4    *   eth0
1   141.38.28.203     0x1   0x2  c4:34:6b:5b:e4:7d    *   eth0
2   141.38.28.140     0x1   0x2  00:19:99:ce:00:19    *   eth0
3   141.38.28.202     0x1   0x2  90:1b:0e:14:a1:e3    *   eth0
4    141.38.28.17     0x1   0x2  90:1b:0e:1a:4b:41    *   eth0
5    141.38.28.60     0x1   0x2  00:19:99:cc:aa:58    *   eth0
6   141.38.28.233     0x1   0x2  90:1b:0e:8d:7a:c9    *   eth0
7    141.38.28.55     0x1   0x2  00:19:99:cc:ab:00    *   eth0
8   141.38.28.224     0x1   0x2  90:1b:0e:8d:7a:e2    *   eth0
9   141.38.28.148     0x1   0x0  4c:52:62:a8:08:2c    *   eth0
10  141.38.28.179     0x1   0x2  90:1b:0e:1a:4b:50    *   eth0

In [232]: df["HW address"]
Out[232]:
0     84:2b:2b:ad:e1:f4
1     c4:34:6b:5b:e4:7d
2     00:19:99:ce:00:19
3     90:1b:0e:14:a1:e3
4     90:1b:0e:1a:4b:41
5     00:19:99:cc:aa:58
6     90:1b:0e:8d:7a:c9
7     00:19:99:cc:ab:00
8     90:1b:0e:8d:7a:e2
9     4c:52:62:a8:08:2c
10    90:1b:0e:1a:4b:50

In [233]: df["HW address"][5]
Out[233]: '00:19:99:cc:aa:58'

By default it tries to figure out the format automagically, but there are options you can give for more explicit instructions (see documentation). There are also other IO routines in pandas that are powerful for other file formats.

最冷一天 2024-08-27 00:11:11

如果分隔符是“:”,则可以按“:”进行拆分,然后在字符串上使用 x.strip() 来删除任何前导或尾随空格。 int() 将忽略空格。

If the separators are ':', you can split on ':', and then use x.strip() on the strings to get rid of any leading or trailing whitespace. int() will ignore the spaces.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文