正则表达式:匹配 HTML 文件中的所有 alt 属性?
我一直在研究问题并更好地了解我的问题,但仍然没有找到答案。
我在 PHP 中的正则表达式方面遇到问题。我正在尝试获取 HTML 文件的“alt”属性中的所有文本。我考虑了所有可能的标签名称(img、输入和区域)以及各种可能发生的情况,例如字符之间的空格和换行符(例如
)。还必须注意,匹配字符串可以用单引号或双引号括起来,并在内部包含其他(不同的)引号,例如:
或者,
。
这对我来说变得很困难(我是正则表达式的初学者),所以我只会向您展示我所得到的。请注意,我试图在字符类中使用反向引用,我发现这是错误的做法(或者我认为是这样)。
'/<\s*(?:img|输入|区域)\s[^>]*alt\s*=\s*("|\')([^\1>]*) \1[^>]*>/siU'
我也在 StackOverflow 中看到,有些人推荐使用 HTML 解析器来做这样的事情,但我担心这种做法会消耗多少资源。你认为这是有更好的主意吗?谢谢!
I've been looking through the questions and got a better idea of my problem, but still, didn't find an answer.
I have a problem with regular expressions in PHP. I'm trying to get all the text in "alt" attributes of an HTML file. I'm taking into account all the possible tag names (img, input and area) and all kind of eventualities, like spaces and line breaks inbetween the characters (like <img alt = "Hello">
). It must also be aware that the match string can be enclosed by single or double quotes and contain other (different) quote marks inside, for example: <img alt="Alan's picture">
or, <img alt='Example for the word "hello" in the text'>
.
This is becoming difficult to me (I'm a beginner with regular expressions) so I'll just show you what I got. Note that I'm trying to use a backrefernce inside a character class, which I found to be a wrong practice (or so I think).
'/<\s*(?:img|input|area)\s[^>]*alt\s*=\s*("|\')([^\1>]*)\1[^>]*>/siU'
I've also seen in StackOverflow, some people recommending HTML parsers for stuff like this, but I'm worried about how much resources this practice may consume. Would you think this is a better idea? Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用解析器绝对是最佳选择。
正则表达式非常不适合此类任务,并且 即使 Jon Skeet 也无法使用正则表达式解析 HTML
Using a parser is definitely the way to go.
Regex are highly inappropriate for this type of tasks, and even Jon Skeet cannot parse HTML using regular expressions
你绝对应该使用解析器。造成这种情况的原因有几个:
alt='why can't I do this'
alt="why the long space"
您也许可以查看StackOverflow 问题 强大、成熟的 HTML 解析器对于 PHP 有关哪些解析器值得使用的一些建议。
Absolutely you should use a parser. There are several reasons for this:
alt='why can't I do this'
alt="why the long space"
You can perhaps check out the StackOverflow question Robust, Mature HTML Parser for PHP for some suggestions about what parsers would be worthwhile to use.