正则表达式不匹配
我有一个 HTML 非常不干净的字符串。在解析它之前,我想将其转换为:
<TABLE><TR><TD width="33%" nowrap=1><font size="1" face="Arial">
NE
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
DEK
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
143
</font> </TD>
</TR></TABLE>
in NE DEK 143
这样更容易解析。我有这个正则表达式(RegexKitLite):
NSString *str = [dataString stringByReplacingOccurrencesOfRegex:@"<TABLE><TR><TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<\\/TR><\\/TABLE>"
withString:@"$1 $3 $5"];
我不是正则表达式专家。有人可以帮我吗?
问候,渡渡鸟
I've got a string with very unclean HTML. Before I parse it, I want to convert this:
<TABLE><TR><TD width="33%" nowrap=1><font size="1" face="Arial">
NE
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
DEK
</font> </TD>
<TD width="33%" nowrap=1><font size="1" face="Arial">
143
</font> </TD>
</TR></TABLE>
in NE DEK 143
so it is a bit easier to parse. I've got this regular expression (RegexKitLite):
NSString *str = [dataString stringByReplacingOccurrencesOfRegex:@"<TABLE><TR><TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<TD width=\"33%\" nowrap=1><font size=\"1\" face=\"Arial\">(.+?)<\\/font> <\\/TD>(.+?)<\\/TR><\\/TABLE>"
withString:@"$1 $3 $5"];
I'm no an expert in Regex. Can someone help me out here?
Regards, dodo
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
阿玛戈什和博宾斯(链接问题的获胜回答者)对此的看法总体上是正确的。然而,由于您只是进行清理,因此正则表达式实际上就可以了。
首先,剥离标签:
然后将所有多余空格折叠为一个:
然后删除前导/尾随空格:
然后获取值:
Amarghosh, and bobince, the winning answerer of linked question, is generally right about this. However, since you are just sanitising, regexps are actually just fine.
First, strip the tags:
Then collapse all extra spaces into one:
Then remove leading/trailing space:
Then get the values:
我对你的正则表达式可能失败的原因有一些怀疑(不知道 iPhone SDK 中字符串转义的规则):点
.
用于必须匹配换行符的地方,斜杠看起来像它被不必要地转义等,但是:在您的示例中,您尝试提取的文本的特点是没有被标签包围。
因此,搜索所有出现的
(?m)^[^<>\r\n]$
应找到所有匹配项。I have a few suspicions about why your regex might fail (without knowing the rules for string escaping in the iPhone SDK): The dot
.
used in places where it would have to match newlines, the slash looks like it's escaped unnecessarily etc.,but: in your example, the text you're trying to extract is characterized by not being surrounded by tags.
So a search for all occurences of
(?m)^[^<>\r\n]$
should find all matches.如果您确定 html 代码层次结构,那么您可以提取字体标签包含的文本:
;
它将是由字体标签包围的文本,边缘没有空格符号。
If you sure of your html-code hierarchy, then you can just extract text enclosed by font-tags:
;
It will be text enclosed by font-tags without white-space symbols by edges.