Python正则表达式问题

发布于 2024-11-06 19:46:21 字数 780 浏览 5 评论 0原文

我正在尝试使用下面的 Python 脚本从 WURFL XML 文件中提取所有手机屏幕分辨率。问题是我只得到了第一场比赛。为什么?我怎样才能获得所有匹配项?

WURFL XML 文件可以在 http 中找到://sourceforge.net/projects/wurfl/files/WURFL/latest/wurfl-latest.zip/download?use_mirror=freefr

def read_file(file_name):
    f = open(file_name, 'rb')
    data = f.read()
    f.close()
    return data

text = read_file('wurfl.xml')

import re
pattern = '<device id="(.*?)".*actual_device_root="true">.*<capability name="resolution_width" value="(\d+)"/>.*<capability name="resolution_height" value="(\d+)"/>.*</device>'
for m in re.findall(pattern, text, re.DOTALL):
    print(m)

I'm trying to extract ALL phone screen resolutions from the WURFL XML file with the below Python script. The problem is that I only get the first match, though. Why? How could I get all matches?

The WURFL XML file can be found at http://sourceforge.net/projects/wurfl/files/WURFL/latest/wurfl-latest.zip/download?use_mirror=freefr

def read_file(file_name):
    f = open(file_name, 'rb')
    data = f.read()
    f.close()
    return data

text = read_file('wurfl.xml')

import re
pattern = '<device id="(.*?)".*actual_device_root="true">.*<capability name="resolution_width" value="(\d+)"/>.*<capability name="resolution_height" value="(\d+)"/>.*</device>'
for m in re.findall(pattern, text, re.DOTALL):
    print(m)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

揽月 2024-11-13 19:46:21

首先,使用 XML 解析器而不是正则表达式。从长远来看,你会更快乐。

其次,如果您坚持使用正则表达式,请使用 finditer() 而不是 findall()

第三,您的正则表达式从第一个条目到最后一个条目匹配(.* 是贪婪的,并且您已设置DOTALL 模式),因此请查看第一段或在至少将正则表达式更改为

pattern = r'<device id="(.*?)".*?actual_device_root="true">.*?<capability name="resolution_width" value="(\d+)"/>.*?<capability name="resolution_height" value="(\d+)"/>.*?</device>'

此外,始终将原始字符串与正则表达式一起使用。 \d 恰好可以工作,但是 \b 在“正常”字符串中会出现意外的行为。

First, use an XML parser instead of regular expressions. You'll be happier in the long run.

Second, if you insist on using regexes, use finditer() instead of findall().

Third, your regex matches from the first entry to the last one (the .* is greedy, and you have set DOTALL mode), so either see the first paragraph or at least change your regex to

pattern = r'<device id="(.*?)".*?actual_device_root="true">.*?<capability name="resolution_width" value="(\d+)"/>.*?<capability name="resolution_height" value="(\d+)"/>.*?</device>'

Also, always use raw strings with regexes. \d happens to work, \b will behave unexpectedly in a "normal" string, though.

橙幽之幻 2024-11-13 19:46:21

这是 findall 行为的一个奇怪之处,具体来说,findall 只返回每个模式匹配中的第一个匹配组。请参阅此问题

This is an oddness in the behaviour of findall, specifically findall only returns the first matching group from each pattern match. See this question.

墨小墨 2024-11-13 19:46:21

您正在使用“贪婪”匹配: .* 将匹配尽可能多的文本,这意味着 之前的 .* > 匹配大部分文件。

text = open('wurfl.xml').read()
pattern = r'<device id="(.*?)".*?actual_device_root="true">.*?<capability name="resolution_width" value="(\d+)"/>.*?<capability name="resolution_height" value="(\d+)"/>.*?</device>'
for m in re.findall(pattern, text, re.DOTALL):
    print m

You are using "greedy" matches: .* will match as much text as it can grab, which means the .* before <capabilities> matches most of the file.

text = open('wurfl.xml').read()
pattern = r'<device id="(.*?)".*?actual_device_root="true">.*?<capability name="resolution_width" value="(\d+)"/>.*?<capability name="resolution_height" value="(\d+)"/>.*?</device>'
for m in re.findall(pattern, text, re.DOTALL):
    print m
初见你 2024-11-13 19:46:21

如果需求很简单,我当然不反对使用正则表达式处理 xml,但也许在这种情况下使用真正的 xml 解析器会更好。使用 stdlib etree 模块和一些(恕我直言)丑陋的 xpath:

import xml.etree.ElementTree as ET

def capability_value(cap_elem):
    if cap_elem is None:
        return None
    return int(cap_elem.attrib.get('value'))

def devices(wurfl_doc):
    for el in wurfl_doc.findall("/devices/device[@actual_device_root='true']"):
        width = el.find("./group[@id='display']/capability[@name='resolution_width']")
        width = capability_value(width)
        height = el.find("./group[@id='display']/capability[@name='resolution_height']")
        height = capability_value(height)
        device = {
            'id' : el.attrib.get('id'), 
            'resolution' : {'width': width, 'height': height}
        }
        yield device

doc = ET.ElementTree(file='wurfl.xml')
for device in devices(doc):
    print device

I'm certainly not averse to handling xml with a regexp if the requirements are simple, but perhaps in this case using a real xml parser would be better. Using the stdlib etree module and a sprinkling of (imho) hideous xpaths:

import xml.etree.ElementTree as ET

def capability_value(cap_elem):
    if cap_elem is None:
        return None
    return int(cap_elem.attrib.get('value'))

def devices(wurfl_doc):
    for el in wurfl_doc.findall("/devices/device[@actual_device_root='true']"):
        width = el.find("./group[@id='display']/capability[@name='resolution_width']")
        width = capability_value(width)
        height = el.find("./group[@id='display']/capability[@name='resolution_height']")
        height = capability_value(height)
        device = {
            'id' : el.attrib.get('id'), 
            'resolution' : {'width': width, 'height': height}
        }
        yield device

doc = ET.ElementTree(file='wurfl.xml')
for device in devices(doc):
    print device
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文