从非唯一表上的 html 检索字符串

发布于 2024-11-08 03:11:01 字数 416 浏览 5 评论 0原文

这是我试图解析的 html。

<TD>Serial Number</TD><TD>AB12345678</TD>

我正在尝试使用正则表达式来解析数据。我听说过 BeautifulSoup，但页面上有大约 50 个类似的项目，它们都使用相同的表参数，而且没有一个有 ID 号。它们与唯一标识符最接近的是单元格中我需要的数据之前的数据。

serialNumber = re.search("Serial Number</td><td>\n(.*?)</td>", source)

Source 就是使用 urllib 抓取的页面的源代码。 html 中第二个和序列号之间有新行，但我不确定这是否重要。

原文

Here is the html I am trying to parse.

<TD>Serial Number</TD><TD>AB12345678</TD>

I am attempting to use regex to parse the data. I heard about BeautifulSoup but there are around 50 items like this on the page all using the same table parameters and none of them have ID numbers. The closest they have to unique identifiers is the data in the cell before the data I need.

serialNumber = re.search("Serial Number</td><td>\n(.*?)</td>", source)

Source is simply the source code of the page grabbed using urllib. There is new line in the html between the second and the serial number but I am unsure if that matters.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

找个人就嫁了吧 2024-11-15 03:11:01

Pyparsing 可以为您的数据提供更强大的提取器：

from pyparsing import makeHTMLTags, Word, alphanums

htmlfrag = """<blah></blah><TD>Serial Number</TD><TD>
            AB12345678
            </TD><stuff></stuff>"""

td,tdEnd = makeHTMLTags("td")

sernoFormat = (td + "Serial Number" + tdEnd + 
                td + Word(alphanums)('serialNumber') + tdEnd)


for sernoData in sernoFormat.searchString(htmlfrag):
    print sernoData.serialNumber

打印：

AB12345678

请注意，pyparsing 并不关心额外的空白落在哪里，它还处理可能出现在定义的标签中的意外属性、标签内的空白、上部的标签/小写等

Pyparsing can give you a little more robust extractor for your data:

from pyparsing import makeHTMLTags, Word, alphanums

htmlfrag = """<blah></blah><TD>Serial Number</TD><TD>
            AB12345678
            </TD><stuff></stuff>"""

td,tdEnd = makeHTMLTags("td")

sernoFormat = (td + "Serial Number" + tdEnd + 
                td + Word(alphanums)('serialNumber') + tdEnd)


for sernoData in sernoFormat.searchString(htmlfrag):
    print sernoData.serialNumber

Prints:

AB12345678

Note that pyparsing doesn't care where the extra whitespace falls, and it also handles unexpected attributes that might crop up in the defined tags, whitespace inside tags, tags in upper/lower case, etc.

回复收藏 0 原文

往日情怀 2024-11-15 03:11:01

在大多数情况下，最好使用适当的解析器来处理 html，但在某些情况下，使用正则表达式来完成这项工作是完全可以的。我对您的任务了解不够，无法判断它是否是一个好的解决方案，或者是否最好使用 @Paul 的解决方案，但在这里我尝试修复您的正则表达式：

serialNumber = re.search("Serial Number</td><td>(.*?)</td>", source, re.S | re.I )

我删除了 \n，因为在我看来这很困难（\n，\r，\r\n，...？），所以我使用了选项 re.S (Dotall)。

但请注意，现在如果有换行符，它将位于您的捕获组中！即您应该随后从结果中删除空格。

正则表达式的另一个问题是字符串中的但您搜索。其中有选项re.I (IgnoreCase)。

您可以在 docs.python.org 上找到有关正则表达式的更多说明

In most of the cases it is better to work on html using an appropriate parser, but for some cases it is perfectly OK to use regular expressions for the job. I do not know enough about your task to judge if it is a good solution or if it is better to go with @Paul 's solution, but here I try to fix your regex:

serialNumber = re.search("Serial Number</td><td>(.*?)</td>", source, re.S | re.I )

I removed the \n, because it is difficult in my opinion (\n,\r,\r\n, ...?), instead I used the option re.S (Dotall).

But be aware, now if there is a newline, it will be in your capturing group! i.e. you should strip whitespaces afterwards from your result.

Another problem of your regex is the <TD> in your string but you search for <td>. There for is the option re.I (IgnoreCase).

You can find more explanations about regex here on docs.python.org

回复收藏 0 原文

~没有更多了~

关于作者

装纯掩盖桑

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

从非唯一表上的 html 检索字符串

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

从非唯一表上的 html 检索字符串

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。