使用 preg_match 发现并验证 html 中嵌入的链接类型

发布于 2024-12-23 15:27:24 字数 1232 浏览 1 评论 0原文

我已经实现了验证 .edu 域的功能。我就是这样做的:

if( preg_match('/edu/', $matches[0])==FALSE )
    return FALSE;
return TRUE;

现在我也想跳过那些指向某些文档(例如 .pdf 和 .doc)的 url。

为此,以下代码应该有效,但无效:

if( preg_match('/edu/', $matches[0])==FALSE || preg_match('/pdf/i', $matches[0])!=FALSE || preg_match('/doc/i', $matches[0]!=FALSE))
        return FALSE;
return TRUE;

在这方面我错在哪里? 此外,我将如何实现 preg_match 以使其具有要检查 url 字符串的文档类型列表。如果找到某种类型的文档,则应返回 false。换句话说,我想提供各种文档类型的列表(可能是一个数组)作为 $pattern 来在 url 中查找。

注意: matches[0] 包含整个 url 字符串。 例如: http://www.nust.edu.pk/Documents/pdf/NNBS_Form .pdf

函数代码:

public function validateEduDomain($url) {
    // get host name from URL
    preg_match('@^(?:http://)?([^/]+)@i', $url, $matches);
    $host = $matches[1];

    // get last two segments of host name
    preg_match('/[^.]+\.[^.]+$/', $host, $matches);

    if( preg_match('/edu/', $matches[0])!=FALSE && (preg_match('/pdf/i', $matches[0])==FALSE || preg_match('/doc/i', $matches[0]==FALSE)))      
        return TRUE;
    return FALSE;
}

I have implemented a function to validate .edu domains. This is how I am doing it:

if( preg_match('/edu/', $matches[0])==FALSE )
    return FALSE;
return TRUE;

Now I want to skip those urls as well that point to some documents such as .pdf and .doc.

For this, the following code should have worked but is not:

if( preg_match('/edu/', $matches[0])==FALSE || preg_match('/pdf/i', $matches[0])!=FALSE || preg_match('/doc/i', $matches[0]!=FALSE))
        return FALSE;
return TRUE;

Where am I wrong in this regard?
Moreover, how will I implement preg_match in such a way that it has a list of document types to check in a url string. If a certain type of document is found, it should return false. In other words, I want to provide a list (an array maybe) of various document types as $pattern to find in a url.

Note:
matches[0] contains the whole url string.
eg: http://www.nust.edu.pk/Documents/pdf/NNBS_Form.pdf

The code for the function:

public function validateEduDomain($url) {
    // get host name from URL
    preg_match('@^(?:http://)?([^/]+)@i', $url, $matches);
    $host = $matches[1];

    // get last two segments of host name
    preg_match('/[^.]+\.[^.]+$/', $host, $matches);

    if( preg_match('/edu/', $matches[0])!=FALSE && (preg_match('/pdf/i', $matches[0])==FALSE || preg_match('/doc/i', $matches[0]==FALSE)))      
        return TRUE;
    return FALSE;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

养猫人 2024-12-30 15:27:24

我想知道为什么你让一切变得如此复杂,并且还注意到你有 $$matches[0] 而不是 $matches[0]。您想要的正则表达式是:

if( preg_match('/^https?:\/\/[A-Za-z]+[A-Za-z0-9\.-]+\.edu/i', $matches[0]) && !preg_match('/\.(pdf)|(doc)$/i', $matches[0]) ) {
    // do something here...
}

I wonder why are you making everything so complicated, and also noticed you have $$matches[0] instead of $matches[0]. The regexes you want is:

if( preg_match('/^https?:\/\/[A-Za-z]+[A-Za-z0-9\.-]+\.edu/i', $matches[0]) && !preg_match('/\.(pdf)|(doc)$/i', $matches[0]) ) {
    // do something here...
}
山田美奈子 2024-12-30 15:27:24

您可以查看文件扩展名是否与以下内容匹配:

 preg_match('/\.php$/i', $string);  

另外,为什么在 $matches[0] 的第二次和第三次用法中使用双美元符号?

You can see if a file extension matches with something like:

 preg_match('/\.php$/i', $string);  

Also, why are you using the double dollar sign for the 2nd and 3rd usages of $matches[0]?

极致的悲 2024-12-30 15:27:24

如果我理解正确,类似这样的内容可以有所帮助: http://ideone.com/XOEiU

function validate_path($url) {
    $url_parts = parse_url($url);
    $path_info = pathinfo($url_parts['path']);

    return preg_match('/\\.edu(?:\\.|$)/', $url_parts['host']) && in_array($path_info['extension'], array('pdf', 'doc', 'docx'));
}

If I understood correctly, something like this can help: http://ideone.com/XOEiU

function validate_path($url) {
    $url_parts = parse_url($url);
    $path_info = pathinfo($url_parts['path']);

    return preg_match('/\\.edu(?:\\.|$)/', $url_parts['host']) && in_array($path_info['extension'], array('pdf', 'doc', 'docx'));
}
终陌 2024-12-30 15:27:24

我不会为此使用正则表达式:

function is_edu_domain($url)
{
    $parsed = parse_url($url);
    $parts = explode('.', $parsed['host']);
    return in_array('edu', $parts, TRUE);
}

这与您在评论中指定的域相匹配。

对于文件扩展名,我将有一个更容易维护的单独函数:

function is_unwanted_file_extension($url)
{
    $path = pathinfo($url);
    $extension = strtolower($path['extension']);
    $unwanted_extensions = explode(',', 'pdf,doc');
    return in_array($extension, $unwanted_extensions, TRUE);
}

您可以将两者结合起来:

function is_url_from_edu_and_wanted($url)
{
    return is_edu_domain($url) and !is_unwanted_file_extension($url);
}

比正则表达式更具可读性和可维护性,但请注意,我已经针对这些事情而不是速度进行了优化。

I wouldn't use a regular expression for this:

function is_edu_domain($url)
{
    $parsed = parse_url($url);
    $parts = explode('.', $parsed['host']);
    return in_array('edu', $parts, TRUE);
}

This matches the domains you specified in your comments.

For the file extensions I would have a separate function that is easier to maintain:

function is_unwanted_file_extension($url)
{
    $path = pathinfo($url);
    $extension = strtolower($path['extension']);
    $unwanted_extensions = explode(',', 'pdf,doc');
    return in_array($extension, $unwanted_extensions, TRUE);
}

You can combine the two:

function is_url_from_edu_and_wanted($url)
{
    return is_edu_domain($url) and !is_unwanted_file_extension($url);
}

Much more readable and maintainable then regular expressions but note that I have optimised for these things and not for speed.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文