替换 的正则表达式与各自的

我正在寻找 PHP preg_replace() 解决方案,找到图像链接并将其替换为相应的图像标签。

查找:

<a href="http://www.domain.tld/any/valid/path/to/imagefile.ext">This will be ignored.</a>

替换为:

<img src="http://www.domain.tld/any/valid/path/to/imagefile.ext" alt="imagefile" />

其中协议必须是 http://,.ext 必须是有效的图像格式(.jpg、.jpeg、.gif、.png、.tif),并且基本文件名变为 alt= ““ 价值。

我知道 preg_replace() 是适合这项工作的函数,但我对正则表达式很糟糕,所以非常感谢任何帮助!谢谢!

I'm looking for a PHP preg_replace() solution find links to images and replace them with respective image tags.

Find:

<a href="http://www.domain.tld/any/valid/path/to/imagefile.ext">This will be ignored.</a>

Replace with:

<img src="http://www.domain.tld/any/valid/path/to/imagefile.ext" alt="imagefile" />

Where the protocol MUST be http://, the .ext MUST be a valid image format (.jpg, .jpeg, .gif, .png, .tif), and the base file name becomes the alt="" value.

I know preg_replace() is the right function for the job, but I suck with regex, so any help is greatly appreciated! THANKS!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

执笏见 2024-08-11 19:13:08

恭喜,您是第一百万个向 Stack Overflow 询问如何使用正则表达式解析 HTML 的客户!

[X][HT]ML 不是常规语言,无法使用正则表达式可靠地进行解析。使用 HTML 解析器。 PHP 本身为您提供了 DOMDocument,或者您可能更喜欢 simplehtmldom

顺便说一句,您无法通过查看文件的 URL 来判断文件的类型。 JPEG 没有理由必须以“.jpeg”作为扩展名 — 事实上,不能保证具有“.jpeg”扩展名的文件实际上就是 JPEG。唯一确定的方法是获取资源(例如使用 HEAD 请求)并查看 Content-Type 标头。

Congratulations, you are the one millionth customer to ask Stack Overflow how to parse HTML with regex!

[X][HT]ML is not a regular language and cannot reliably be parsed with regex. Use an HTML parser. PHP itself gives you DOMDocument, or you may prefer simplehtmldom.

Incidentally, you cannot tell what type a file is by looking at its URL. There is no reason a JPEG has to have ‘.jpeg’ as its extension — and indeed, no guarantee that a file with ‘.jpeg’ extension will actually be JPEG. The only way to be certain is to fetch the resource (eg. using a HEAD request) and look at the Content-Type header.

治碍 2024-08-11 19:13:08

啊,我每天的 DOM 练习。您应该使用 DOM 来解析 HTML,并使用正则表达式来解析字符串,例如 html 属性。

注意:我有一些基本的正则表达式,肯定可以通过一些向导进行改进:)

​​ 注意#2:虽然这可能会产生额外的开销,但您可以使用像curl这样的东西通过发送 HEAD 请求来彻底检查 href 是否是实际图像查看 Content-Type,但这适用于 80-90% 的情况。

<?php

$content = '

<a href="http://www.domain.tld/any/valid/path/to/imagefile.ext">This will be ignored.</a>
<br>

<a href="http://col.stb.s-msn.com/i/43/A4711309495C88F8CD154C99FCE.jpg">this will not be ignored</a>

<br>

<a href="http://col.stb.s-msn.com/i/A0/8E9A454F701E4F5F89E58E14B532C.jpg">bah</a>
';

$dom = new DOMDocument();
$dom->loadHTML($content);

$anchors = $dom->getElementsByTagName('a');

$i = $anchors->length-1;

$protocol = '/^http:\/\//';
$ext = '/([\w+]+)\.(?:gif|jpg|jpeg|png)$/';

if ( count($anchors->length) > 0 ) {
    while( $i > -1 ) {
    $anchor = $anchors->item($i);
    if ( $anchor->hasAttribute('href') ) {
        $link = $anchor->getAttribute('href');

        if ( 
        preg_match ( $protocol , $link ) &&
        preg_match ( $ext, $link )
        ) {
        //echo 'replacing this one.';
        $image = $dom->createElement('img');

        if ( preg_match( $ext, $link, $matches ) ) {
            if ( count($matches) ) {
            $altName = $matches[1];
            $image->setAttribute('alt', $altName);
            }
            $image->setAttribute('src', $link);
            $anchor->parentNode->replaceChild( $image, $anchor );
        }
        }

    }
    $i--;
    }
}

echo $dom->saveHTML();

Ahh, my daily DOM practice. You should use DOM to parse HTML and regex to parse strings such as html attributes.

Note: I have some basic regexes that could surely be improved upon by some wizards :)

Note #2: Though it might be extra overhead you could use something like curl to thoroughly check if the href is an actual image by sending a HEAD request and looking at the Content-Type, but this would work in 80-90% of cases.

<?php

$content = '

<a href="http://www.domain.tld/any/valid/path/to/imagefile.ext">This will be ignored.</a>
<br>

<a href="http://col.stb.s-msn.com/i/43/A4711309495C88F8CD154C99FCE.jpg">this will not be ignored</a>

<br>

<a href="http://col.stb.s-msn.com/i/A0/8E9A454F701E4F5F89E58E14B532C.jpg">bah</a>
';

$dom = new DOMDocument();
$dom->loadHTML($content);

$anchors = $dom->getElementsByTagName('a');

$i = $anchors->length-1;

$protocol = '/^http:\/\//';
$ext = '/([\w+]+)\.(?:gif|jpg|jpeg|png)$/';

if ( count($anchors->length) > 0 ) {
    while( $i > -1 ) {
    $anchor = $anchors->item($i);
    if ( $anchor->hasAttribute('href') ) {
        $link = $anchor->getAttribute('href');

        if ( 
        preg_match ( $protocol , $link ) &&
        preg_match ( $ext, $link )
        ) {
        //echo 'replacing this one.';
        $image = $dom->createElement('img');

        if ( preg_match( $ext, $link, $matches ) ) {
            if ( count($matches) ) {
            $altName = $matches[1];
            $image->setAttribute('alt', $altName);
            }
            $image->setAttribute('src', $link);
            $anchor->parentNode->replaceChild( $image, $anchor );
        }
        }

    }
    $i--;
    }
}

echo $dom->saveHTML();
踏月而来 2024-08-11 19:13:08

我建议使用这个更灵活的非贪婪正则表达式:

<a[^>]+?href=\"(http:\/\/[^\"]+?\/([^\"]*?)\.(jpg|jpeg|png|gif))[^>]*?>[^<]*?<\/a>

和一个更复杂的正则表达式(包括 PHP 测试代码)希望能取悦 Gumbo :)

<?php
$test_data = <<<END
<a blabla="asldlsaj" alksjada="aslkdj" href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
Lorem ipsum..
<a    blabla=asldlsaj alksjada="aslkdj" href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
<a lkjafs='asdsa> ' blabla="asldlksjada=>"aslkdj" href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
<a    blabla="ajada="aslk href="http://www.domain.tld/any/valid/path>/to/imagefile.jpg" lkjasd>asdlaskjd>This will be ignored.</a>
<a    blabla="asldlsaj>" aslkdj href="http://www.domain.tld/any/valid/path/ to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
Something:
<a    blabla='asldls<ajslkdj' href="http://www.domain.tld/any/valid'/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
<a    blabla=  asldlsadj href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd>This will be ignored.</a>
<a blabla="asldlsaj" alksjslkdj" href='http://www.domain.tld/any/valid/path/to/imagefile.jpg' lkjasdskjd>This will be ignored.</a>
Something else...
<a    blabla="asldlsaj" alksjslkdj" href='http://www.domain.tld/any/valid/path/to/imagefile.jpg' lkjasdskjd>This will be ignored.</a>
<a    blabla="asldlsaj" alksjada="aslkdj" href=http://www.domain.tld/any/valid/path/to/imagefile.jpg lkjdlaskjdll> be ignored.</a>
END;
$regex = "/<a\s(\s*\w+(\s*=\s*(\".*?\"|'.*?'|[^'\">\s]+))?)+?\s+href\s*=\s*(\"(http:\/\/[^\"]+\/(.*?)\.(jpg|jpeg|png|gif))\"|'(http:\/\/[^']+\/(.*?)\.(jpg|jpeg|png|gif))'|(http:\/\/[^'\">\s]+\/([^'\">\s]+)\.(jpg|jpeg|png|gif)))\s(\s*\w+(\s*=\s*(\".*?\"|'.*?'|[^'\">\s]+))?)+>[^<]*?<\/a>/i";
$replaced = preg_replace($regex, '<img src="$5$8$11" alt="$6$9$12" />', $test_data);

echo '<pre>'.htmlentities($replaced);
?>

I would suggest using this more flexible non-greddy regex:

<a[^>]+?href=\"(http:\/\/[^\"]+?\/([^\"]*?)\.(jpg|jpeg|png|gif))[^>]*?>[^<]*?<\/a>

And a more complex regex (including PHP test code) to hopefully please Gumbo :)

<?php
$test_data = <<<END
<a blabla="asldlsaj" alksjada="aslkdj" href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
Lorem ipsum..
<a    blabla=asldlsaj alksjada="aslkdj" href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
<a lkjafs='asdsa> ' blabla="asldlksjada=>"aslkdj" href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
<a    blabla="ajada="aslk href="http://www.domain.tld/any/valid/path>/to/imagefile.jpg" lkjasd>asdlaskjd>This will be ignored.</a>
<a    blabla="asldlsaj>" aslkdj href="http://www.domain.tld/any/valid/path/ to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
Something:
<a    blabla='asldls<ajslkdj' href="http://www.domain.tld/any/valid'/path/to/imagefile.jpg" lkjasd=""asdlaskjd>This will be ignored.</a>
<a    blabla=  asldlsadj href="http://www.domain.tld/any/valid/path/to/imagefile.jpg" lkjasd>This will be ignored.</a>
<a blabla="asldlsaj" alksjslkdj" href='http://www.domain.tld/any/valid/path/to/imagefile.jpg' lkjasdskjd>This will be ignored.</a>
Something else...
<a    blabla="asldlsaj" alksjslkdj" href='http://www.domain.tld/any/valid/path/to/imagefile.jpg' lkjasdskjd>This will be ignored.</a>
<a    blabla="asldlsaj" alksjada="aslkdj" href=http://www.domain.tld/any/valid/path/to/imagefile.jpg lkjdlaskjdll> be ignored.</a>
END;
$regex = "/<a\s(\s*\w+(\s*=\s*(\".*?\"|'.*?'|[^'\">\s]+))?)+?\s+href\s*=\s*(\"(http:\/\/[^\"]+\/(.*?)\.(jpg|jpeg|png|gif))\"|'(http:\/\/[^']+\/(.*?)\.(jpg|jpeg|png|gif))'|(http:\/\/[^'\">\s]+\/([^'\">\s]+)\.(jpg|jpeg|png|gif)))\s(\s*\w+(\s*=\s*(\".*?\"|'.*?'|[^'\">\s]+))?)+>[^<]*?<\/a>/i";
$replaced = preg_replace($regex, '<img src="$5$8$11" alt="$6$9$12" />', $test_data);

echo '<pre>'.htmlentities($replaced);
?>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文