用于提取 HTML 图像属性的正则表达式

发布于 2024-07-10 17:18:01 字数 574 浏览 13 评论 0原文

我需要一个正则表达式模式来提取图像标签的所有属性。

众所周知，存在大量格式错误的 HTML，因此该模式必须涵盖这些可能性。

我正在查看这个解决方案 https://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php但是它并没有完全明白：

我想出了类似的东西：

(alt|title|src|height|width)\s*=\s*["'][\W\w]+?["']

是否有任何我会错过的可能性或更有效的简单模式？

编辑：
抱歉，我会更具体，我正在使用 .NET 执行此操作，因此它位于服务器端。
我已经有了一个 img 标签列表，现在我只需要解析属性。

原文

I need a RegEx pattern for extracting all the properties of an image tag.

As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities.

I was looking at this solution https://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php but it didn't quite get it all:

I come up something like:

(alt|title|src|height|width)\s*=\s*["'][\W\w]+?["']

Is there any possibilities I'll be missing or a more efficient simple pattern?

EDIT:

Sorry, I will be more specific, I'm doing this using .NET so it's on the server side.

I've already a list of img tags, now I just need to parse the properties.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

南风起 2024-07-17 17:18:02

最好的选择是使用 HTML Agility Pack 之类的东西，而不是使用正则表达式。它旨在处理很多情况，并且可以为您节省不少因敲定边缘情况而带来的麻烦

回复收藏 0 原文

青衫负雪 2024-07-17 17:18:02

如果您想要所有属性值，我可以建议使用 DOM 吗？像 element.attributes 这样的东西会很好用。

如果你坚持使用正则表达式 //\b\w+="[^"]+"// 应该得到一切。

回复收藏 0 原文

夏至、离别 2024-07-17 17:18:02

在尝试使用正则表达式之前，请先了解它的功能：正则表达式匹配除 XHTML 自包含标记之外的开放标记

回复收藏 0 原文

能否归途做我良人 2024-07-17 17:18:02

/<img(\s+([a-z]{3,})=(["']([^"']*)["']|[\S]))+\s*/?>/i

对此的 match_all 将返回（格式取决于您的库，但关键索引是）：

0 -> image tag
1 -> attribute
2 -> attribute name
3 -> attribute value (with enclosing quotes if exists)
4 -> attribute value (without enclosing quotes if it has them, otherwise empty, use 3)

/<img(\s+([a-z]{3,})=(["']([^"']*)["']|[\S]))+\s*/?>/i

A match_all on this, will return (format depends on your library, but key indexes are):

0 -> image tag
1 -> attribute
2 -> attribute name
3 -> attribute value (with enclosing quotes if exists)
4 -> attribute value (without enclosing quotes if it has them, otherwise empty, use 3)

回复收藏 0 原文

软的没边 2024-07-17 17:18:01

众所周知，存在大量格式错误的 HTML，因此该模式必须涵盖这些可能性。

不会的。如果您必须解析“邪恶”（来自未知来源）HTML，请使用 HTML 解析器。

回复收藏 0 原文

梦幻之岛 2024-07-17 17:18:01

如果性能不是一个大问题，我会使用 html 解析器（例如 BeautifulSoup 中python) 如果您在服务器端执行此操作，或者 jquery 或简单的 javascript（如果您在客户端执行此操作）。诚然，它有点矫枉过正，但它速度更快，出现错误的可能性更小（因为他们已经考虑到了极端情况），并且它会处理潜在的畸形问题。

回复收藏 0 原文

~没有更多了~