使用 preg_match 发现并验证 html 中嵌入的链接类型

发布于 2024-12-23 15:27:24 字数 1232 浏览 1 评论 0原文

我已经实现了验证 .edu 域的功能。我就是这样做的：

if( preg_match('/edu/', $matches[0])==FALSE )
    return FALSE;
return TRUE;

现在我也想跳过那些指向某些文档（例如 .pdf 和 .doc）的 url。

为此，以下代码应该有效，但无效：

if( preg_match('/edu/', $matches[0])==FALSE || preg_match('/pdf/i', $matches[0])!=FALSE || preg_match('/doc/i', $matches[0]!=FALSE))
        return FALSE;
return TRUE;

在这方面我错在哪里？此外，我将如何实现 preg_match 以使其具有要检查 url 字符串的文档类型列表。如果找到某种类型的文档，则应返回 false。换句话说，我想提供各种文档类型的列表（可能是一个数组）作为 $pattern 来在 url 中查找。

注意： matches[0] 包含整个 url 字符串。例如： http://www.nust.edu.pk/Documents/pdf/NNBS_Form .pdf

函数代码：

public function validateEduDomain($url) {
    // get host name from URL
    preg_match('@^(?:http://)?([^/]+)@i', $url, $matches);
    $host = $matches[1];

    // get last two segments of host name
    preg_match('/[^.]+\.[^.]+$/', $host, $matches);

    if( preg_match('/edu/', $matches[0])!=FALSE && (preg_match('/pdf/i', $matches[0])==FALSE || preg_match('/doc/i', $matches[0]==FALSE)))      
        return TRUE;
    return FALSE;
}

原文

I have implemented a function to validate .edu domains. This is how I am doing it:

if( preg_match('/edu/', $matches[0])==FALSE )
    return FALSE;
return TRUE;

Now I want to skip those urls as well that point to some documents such as .pdf and .doc.

For this, the following code should have worked but is not:

if( preg_match('/edu/', $matches[0])==FALSE || preg_match('/pdf/i', $matches[0])!=FALSE || preg_match('/doc/i', $matches[0]!=FALSE))
        return FALSE;
return TRUE;

Where am I wrong in this regard?
Moreover, how will I implement preg_match in such a way that it has a list of document types to check in a url string. If a certain type of document is found, it should return false. In other words, I want to provide a list (an array maybe) of various document types as $pattern to find in a url.

Note:
matches[0] contains the whole url string.
eg: http://www.nust.edu.pk/Documents/pdf/NNBS_Form.pdf

The code for the function:

public function validateEduDomain($url) {
    // get host name from URL
    preg_match('@^(?:http://)?([^/]+)@i', $url, $matches);
    $host = $matches[1];

    // get last two segments of host name
    preg_match('/[^.]+\.[^.]+$/', $host, $matches);

    if( preg_match('/edu/', $matches[0])!=FALSE && (preg_match('/pdf/i', $matches[0])==FALSE || preg_match('/doc/i', $matches[0]==FALSE)))      
        return TRUE;
    return FALSE;
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

养猫人 2024-12-30 15:27:24

我想知道为什么你让一切变得如此复杂，并且还注意到你有 $$matches[0] 而不是 $matches[0]。您想要的正则表达式是：

if( preg_match('/^https?:\/\/[A-Za-z]+[A-Za-z0-9\.-]+\.edu/i', $matches[0]) && !preg_match('/\.(pdf)|(doc)$/i', $matches[0]) ) {
    // do something here...
}

I wonder why are you making everything so complicated, and also noticed you have $$matches[0] instead of $matches[0]. The regexes you want is:

if( preg_match('/^https?:\/\/[A-Za-z]+[A-Za-z0-9\.-]+\.edu/i', $matches[0]) && !preg_match('/\.(pdf)|(doc)$/i', $matches[0]) ) {
    // do something here...
}

回复收藏 0 原文

山田美奈子 2024-12-30 15:27:24

您可以查看文件扩展名是否与以下内容匹配：

 preg_match('/\.php$/i', $string);

另外，为什么在 $matches[0] 的第二次和第三次用法中使用双美元符号？

You can see if a file extension matches with something like:

 preg_match('/\.php$/i', $string);

Also, why are you using the double dollar sign for the 2nd and 3rd usages of $matches[0]?

回复收藏 0 原文

极致的悲 2024-12-30 15:27:24

如果我理解正确，类似这样的内容可以有所帮助： http://ideone.com/XOEiU

function validate_path($url) {
    $url_parts = parse_url($url);
    $path_info = pathinfo($url_parts['path']);

    return preg_match('/\\.edu(?:\\.|$)/', $url_parts['host']) && in_array($path_info['extension'], array('pdf', 'doc', 'docx'));
}

If I understood correctly, something like this can help: http://ideone.com/XOEiU

function validate_path($url) {
    $url_parts = parse_url($url);
    $path_info = pathinfo($url_parts['path']);

    return preg_match('/\\.edu(?:\\.|$)/', $url_parts['host']) && in_array($path_info['extension'], array('pdf', 'doc', 'docx'));
}

回复收藏 0 原文

终陌 2024-12-30 15:27:24

我不会为此使用正则表达式：

function is_edu_domain($url)
{
    $parsed = parse_url($url);
    $parts = explode('.', $parsed['host']);
    return in_array('edu', $parts, TRUE);
}

这与您在评论中指定的域相匹配。

对于文件扩展名，我将有一个更容易维护的单独函数：

function is_unwanted_file_extension($url)
{
    $path = pathinfo($url);
    $extension = strtolower($path['extension']);
    $unwanted_extensions = explode(',', 'pdf,doc');
    return in_array($extension, $unwanted_extensions, TRUE);
}

您可以将两者结合起来：

function is_url_from_edu_and_wanted($url)
{
    return is_edu_domain($url) and !is_unwanted_file_extension($url);
}

比正则表达式更具可读性和可维护性，但请注意，我已经针对这些事情而不是速度进行了优化。

I wouldn't use a regular expression for this:

function is_edu_domain($url)
{
    $parsed = parse_url($url);
    $parts = explode('.', $parsed['host']);
    return in_array('edu', $parts, TRUE);
}

This matches the domains you specified in your comments.

For the file extensions I would have a separate function that is easier to maintain:

function is_unwanted_file_extension($url)
{
    $path = pathinfo($url);
    $extension = strtolower($path['extension']);
    $unwanted_extensions = explode(',', 'pdf,doc');
    return in_array($extension, $unwanted_extensions, TRUE);
}

You can combine the two:

function is_url_from_edu_and_wanted($url)
{
    return is_edu_domain($url) and !is_unwanted_file_extension($url);
}

Much more readable and maintainable then regular expressions but note that I have optimised for these things and not for speed.

回复收藏 0 原文

~没有更多了~

关于作者

谁的新欢旧爱

暂无简介

文章

600 人气

关注发私信

友情链接

文江博客

使用 preg_match 发现并验证 html 中嵌入的链接类型

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

佚名

羁客

天天爱笑的徐老师

星

夏日落

隐诗

友情链接

使用 preg_match 发现并验证 html 中嵌入的链接类型

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

佚名

羁客

天天爱笑的徐老师

星

夏日落

隐诗

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。