图像标签抓取正则表达式
我真的很不擅长正则表达式。只是还没有点击。我正在尝试制作一个小型应用程序来提取其 src、宽度和高度属性的所有图像标签。这就是我到目前为止所拥有的:
<?php
function print_links ($url)
{
$fp = fopen($url, "r") or die("Could not contact $url");
$page_contents = "";
while ($new_text = fread($fp, 100)) {
$page_contents .= $new_text;
}
$match_result =
preg_match_all( '/<img.*src=[\"\'](.*)[\"\'].*width=(\d+).*height=(\d+).*/>/i',
$page_contents,
$match_array,
PREG_SET_ORDER);
echo "number matched is: $match_result<br><br> ";
print_r($match_array);
foreach ($match_array as $entry) {
$tag = $entry[0];
$src = $entry[1];
$width = $entry[2];
$height = $entry[3];
print (" <b>src</b>: $src;
<b>width</b>: $width<br />
<b>height</b>: $height<br />
<b>tag</b>: $tag<br />"
);
}
}
print_links ("http://www.drudgereport.com/");
?>
但我收到这个小错误:
警告:preg_match_all():未知修饰符'>'在 C:\Apache2.2\htdocs\it302\regex\regex.php 中第 17 行匹配的数字是:
我不确定我的正则表达式哪里出错了。我尝试过很多事情,但最终都感到困惑。
有什么建议吗?
I'm really REALLY bad at regular expressions. It just hasn't clicked yet. I'm trying to make small application that extracts all image tags of their src, width, and height attributes. This is what I have so far:
<?php
function print_links ($url)
{
$fp = fopen($url, "r") or die("Could not contact $url");
$page_contents = "";
while ($new_text = fread($fp, 100)) {
$page_contents .= $new_text;
}
$match_result =
preg_match_all( '/<img.*src=[\"\'](.*)[\"\'].*width=(\d+).*height=(\d+).*/>/i',
$page_contents,
$match_array,
PREG_SET_ORDER);
echo "number matched is: $match_result<br><br> ";
print_r($match_array);
foreach ($match_array as $entry) {
$tag = $entry[0];
$src = $entry[1];
$width = $entry[2];
$height = $entry[3];
print (" <b>src</b>: $src;
<b>width</b>: $width<br />
<b>height</b>: $height<br />
<b>tag</b>: $tag<br />"
);
}
}
print_links ("http://www.drudgereport.com/");
?>
but I get this little error:
Warning: preg_match_all(): Unknown modifier '>' in C:\Apache2.2\htdocs\it302\regex\regex.php on line 17 number matched is:
I'm not sure where I went wrong in my regexp. I've tried multiple things but have ended up just as confused.
Any suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在您的正则表达式中,最后一个
.*/>
是错误的。没有
/
那里...或
\/?
转义并使其可选...但是此正则表达式仅在 src width height 在 img 标记内按给定顺序排列时才有效宽度和高度也允许引用值和单位。例如 width="0.9em" 是有效的 html...
这都是您不应该使用正则表达式来解析 html 的原因(还有更多...)
In your regex the last
.*/>
is wrong.no
/
there...or
\/?
escape and make it optional...but this regex only works if src width height are in this given order within the img tag and width and height also allow quoted values and units. e.g. width="0.9em" is valid html...
this are all reasons why you should not use regex to parse html (and many more...)
不要为此使用正则表达式。特别是如果你真的很糟糕:)
http://simplehtmldom.sourceforge.net/
Do not use regex for this. Especially if you are REALLY bad :)
http://simplehtmldom.sourceforge.net/