用于提取 HTML 图像属性的正则表达式
我需要一个正则表达式模式来提取图像标签的所有属性。
众所周知,存在大量格式错误的 HTML,因此该模式必须涵盖这些可能性。
我正在查看这个解决方案 https://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php但是它并没有完全明白:
我想出了类似的东西:
(alt|title|src|height|width)\s*=\s*["'][\W\w]+?["']
是否有任何我会错过的可能性或更有效的简单模式?
编辑:
抱歉,我会更具体,我正在使用 .NET 执行此操作,因此它位于服务器端。
我已经有了一个 img 标签列表,现在我只需要解析属性。
I need a RegEx pattern for extracting all the properties of an image tag.
As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities.
I was looking at this solution https://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php but it didn't quite get it all:
I come up something like:
(alt|title|src|height|width)\s*=\s*["'][\W\w]+?["']
Is there any possibilities I'll be missing or a more efficient simple pattern?
EDIT:
Sorry, I will be more specific, I'm doing this using .NET so it's on the server side.
I've already a list of img tags, now I just need to parse the properties.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
最好的选择是使用 HTML Agility Pack 之类的东西,而不是使用正则表达式。 它旨在处理很多情况,并且可以为您节省不少因敲定边缘情况而带来的麻烦
Your best bet is to use something like HTML Agility Pack instead of using regex. It's designed to handle a lot of cases and can save you more than a few headaches due to hammering out edge cases
如果您想要所有属性值,我可以建议使用 DOM 吗? 像
element.attributes
这样的东西会很好用。如果你坚持使用正则表达式
//\b\w+="[^"]+"//
应该得到一切。If you want all attribute values, might I suggest using the DOM? Something like
element.attributes
will work well.If you insist on a regex
//\b\w+="[^"]+"//
should get everything.在尝试使用正则表达式之前,请先了解它的功能:正则表达式匹配除 XHTML 自包含标记之外的开放标记
Before comitting yourself to regex, see what it can do: RegEx match open tags except XHTML self-contained tags
对此的 match_all 将返回(格式取决于您的库,但关键索引是):
A match_all on this, will return (format depends on your library, but key indexes are):
不会的。 如果您必须解析“邪恶”(来自未知来源)HTML,请使用 HTML 解析器。
It won't. Use a HTML parser if you have to parse "evil" (from an unknown source) HTML.
如果性能不是一个大问题,我会使用 html 解析器(例如 BeautifulSoup 中python) 如果您在服务器端执行此操作,或者 jquery 或简单的 javascript(如果您在客户端执行此操作)。 诚然,它有点矫枉过正,但它速度更快,出现错误的可能性更小(因为他们已经考虑到了极端情况),并且它会处理潜在的畸形问题。
If performance is not a big concern I'd go with an html parser (like BeautifulSoup in python) if you are doing this server-side or jquery or just plain javascript if you are doing it client-side. Granted it is overkill but it is a lot quicker, less likely to have bugs (since they've thought of the corner cases), and it will handle the potential malformedness.