如何在 PHP 中将可能的 URI 与其他内容分开?

发布于 2024-10-06 20:29:25 字数 1033 浏览 0 评论 0原文

检查字符串是否为单个 URL 或文本(可能包含 url)的最简单、最快的方法是什么?

可能的情况:

// successful scenario
$example[] = 'http://sub-domain.my-domain.com/folder/file.php?some=param';
// successful scenario
$example[] = '/assets/scripts/jquery.min.js?v=1.4';
// successful scenario
$example[] = 'jquery.min.js';
// this scenario should fail validation
$example[] = "http://www.domain.com welcome text\n and some other http://www.domain.com";
// this scenario should fail validation
$example[] = "scriptVar=50;";

我尝试使用本机 php 函数,如 parse_url、filter_var,但它们都没有按预期工作。

更新 1

为了更清楚地说明,我尝试将可能的 URI 与将作为 DOM 元素插入的脚本内容分开。所有网址都将作为 SRC 属性并作为内容,例如:

<script type="text/javascript" src="{$string}"></script>
<script type="text/javascript">{$string}</script>

UPDATE 2 通过分析可能的内容,我得出结论,包含空格字符或分号的字符串意味着该字符串不能是 URI,我认为这种模式可以解决我的问题:

preg_match('/[\s]|[;]/', $string);

它会覆盖所有可能的 javascript/css 代码吗?

What is the simplest and fastest way to check if string is single URL or TEXT (that might contain urls)

possible scenarios:

// successful scenario
$example[] = 'http://sub-domain.my-domain.com/folder/file.php?some=param';
// successful scenario
$example[] = '/assets/scripts/jquery.min.js?v=1.4';
// successful scenario
$example[] = 'jquery.min.js';
// this scenario should fail validation
$example[] = "http://www.domain.com welcome text\n and some other http://www.domain.com";
// this scenario should fail validation
$example[] = "scriptVar=50;";

I have tried to use native php functions like parse_url, filter_var but non of them work as expected.

UPDATE 1

To make it more clear, I'm trying to separate possible URI from script content that would be inserted as DOM element. All urls would go as SRC attribute and rest as content, example:

<script type="text/javascript" src="{$string}"></script>
<script type="text/javascript">{$string}</script>

UPDATE 2
By analysing possible content I come to conclusion that string containing white space character or semicolon mean that string could not be URI, I presume that this pattern could solve my problem:

preg_match('/[\s]|[;]/', $string);

would it cover all possible javascript/css code?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

素年丶 2024-10-13 20:29:25
$exampleData = Array(
    'http://sub-domain.my-domain.com/folder/file.php?some=param',
    '/assets/scripts/jquery.min.js?v=1.4',
    '<a href="/assets/scripts/jquery.min.js?v=1.4">',
    '<a href="assets/scripts/jquery.min.js?v=1.4">',
    'http://www.domain.com welcome text\n and some other http://www.domain.com',
);

foreach($exampleData as $example)
{
    echo "Trying \"" . $example . "\" -> ";

    echo (preg_match('%((http(s)?://|www\.)[^ \r\n]+|<a.+?href=(\'|")(http(s)?://|www\.|[^#])[^\4\r\n]*?\4.*?>)%i', $example)) ?
     "Match" : "No match";

    echo "\r\n";
}

这将产生:

Trying "http://sub-domain.my-domain.com/folder/file.php?some=param" -> Match
Trying "/assets/scripts/jquery.min.js?v=1.4" -> No match
Trying "<a href="/assets/scripts/jquery.min.js?v=1.4">" -> Match
Trying "<a href="assets/scripts/jquery.min.js?v=1.4">" -> Match
Trying "http://www.domain.com welcome text\n and some other http://www.domain.com" -> Match

更新:

阅读您的上次更新后。如果你想解析 HTML。使用 DOM 解析器,例如:

http://simplehtmldom.sourceforge.net/

示例:

include_once('simple_html_dom.php');

$dom = file_get_html('http://www.stackoverflow.com/');

foreach($dom->find('script') as $scriptElement)
{
    if(strlen(trim($scriptElement->src)) > 0)
    {
        // Script with URI set
        echo "<strong>Found script with URI</strong>";
        echo "<p>" . $scriptElement->src . "</p>";
    }
    else
    {
        // Script with content
        echo "<strong>Found script with content</strong>";
        echo("<p>" . nl2br(htmlspecialchars($scriptElement->innertext)) . "</p>");
    }
}

会输出类似( HTML 已删除):

Found script with URI
http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js

Found script with URI
http://sstatic.net/js/master.min.js?v=afc76d4deac3

Found script with content    
var imagePath='http://sstatic.net/stackoverflow/img/';
var inboxUnviewedCount = -1;

...etc
$exampleData = Array(
    'http://sub-domain.my-domain.com/folder/file.php?some=param',
    '/assets/scripts/jquery.min.js?v=1.4',
    '<a href="/assets/scripts/jquery.min.js?v=1.4">',
    '<a href="assets/scripts/jquery.min.js?v=1.4">',
    'http://www.domain.com welcome text\n and some other http://www.domain.com',
);

foreach($exampleData as $example)
{
    echo "Trying \"" . $example . "\" -> ";

    echo (preg_match('%((http(s)?://|www\.)[^ \r\n]+|<a.+?href=(\'|")(http(s)?://|www\.|[^#])[^\4\r\n]*?\4.*?>)%i', $example)) ?
     "Match" : "No match";

    echo "\r\n";
}

This would produce:

Trying "http://sub-domain.my-domain.com/folder/file.php?some=param" -> Match
Trying "/assets/scripts/jquery.min.js?v=1.4" -> No match
Trying "<a href="/assets/scripts/jquery.min.js?v=1.4">" -> Match
Trying "<a href="assets/scripts/jquery.min.js?v=1.4">" -> Match
Trying "http://www.domain.com welcome text\n and some other http://www.domain.com" -> Match

Update:

After reading your last update. If you want to parse HTML. Use a DOM-parser like:

http://simplehtmldom.sourceforge.net/

Example:

include_once('simple_html_dom.php');

$dom = file_get_html('http://www.stackoverflow.com/');

foreach($dom->find('script') as $scriptElement)
{
    if(strlen(trim($scriptElement->src)) > 0)
    {
        // Script with URI set
        echo "<strong>Found script with URI</strong>";
        echo "<p>" . $scriptElement->src . "</p>";
    }
    else
    {
        // Script with content
        echo "<strong>Found script with content</strong>";
        echo("<p>" . nl2br(htmlspecialchars($scriptElement->innertext)) . "</p>");
    }
}

Would output something like(HTML stripped):

Found script with URI
http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js

Found script with URI
http://sstatic.net/js/master.min.js?v=afc76d4deac3

Found script with content    
var imagePath='http://sstatic.net/stackoverflow/img/';
var inboxUnviewedCount = -1;

...etc
御弟哥哥 2024-10-13 20:29:25

如果传递的文本是 URL,则此函数将返回 true。它基于 SO 上看到的正则表达式。

function validate_url ($url)
{
  $regex = '/^(https?|ftp):\/\/'; //protocol
  $regex .= '(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+'; //username
  $regex .= '(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?'; //password
  $regex .= '@)?'; //auth requires @
  $regex .= '((([a-z0-9][a-z0-9-]*[a-z0-9]\.)*'; //domain segments AND
  $regex .= '[a-z][a-z0-9-]*[a-z0-9]'; //top level domain  OR
  $regex .= '|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}';
  $regex .= '(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])'; //IP address
  $regex .= ')(:\d+)?'; //port
  $regex .= ')(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*'; //path
  $regex .= '(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)'; //query string
  $regex .= '?)?)?'; //path and query string optional
  $regex .= '(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?'; //fragment
  $regex .= '$/i';

  return (preg_match($regex, $url) ? true : false);
}

您可以在这里尝试:http://www.exorithm.com/algorithm/view/validate_url

编辑 作为对评论的回应,此函数将验证 URL 片段,例如 /index.php 或 index.php

function validate_url_fragment ($url)
{
  $regex = '/^(((\/?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*'; //path
  $regex .= '(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)'; //query string
  $regex .= '?)?)?'; //path and query string optional
  $regex .= '(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?'; //fragment
  $regex .= '$/i';

  return (preg_match($regex, $url) ? true : false);
}

if (validate_url_fragment($url) || validate_url($url)) {
  //is url
} else {
  //not url
}

(请注意,空字符串是有效的,因此您可能需要一个特殊情况)

This function will return true if the passed text is an URL. It is based on a regex seen here on SO.

function validate_url ($url)
{
  $regex = '/^(https?|ftp):\/\/'; //protocol
  $regex .= '(([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+'; //username
  $regex .= '(:([a-z0-9$_\.\+!\*\'\(\),;\?&=-]|%[0-9a-f]{2})+)?'; //password
  $regex .= '@)?'; //auth requires @
  $regex .= '((([a-z0-9][a-z0-9-]*[a-z0-9]\.)*'; //domain segments AND
  $regex .= '[a-z][a-z0-9-]*[a-z0-9]'; //top level domain  OR
  $regex .= '|((\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])\.){3}';
  $regex .= '(\d|[1-9]\d|1\d{2}|2[0-4][0-9]|25[0-5])'; //IP address
  $regex .= ')(:\d+)?'; //port
  $regex .= ')(((\/+([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*'; //path
  $regex .= '(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)'; //query string
  $regex .= '?)?)?'; //path and query string optional
  $regex .= '(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?'; //fragment
  $regex .= '$/i';

  return (preg_match($regex, $url) ? true : false);
}

You can try it here: http://www.exorithm.com/algorithm/view/validate_url

EDIT in response to comment, this function will validate URL fragments like /index.php or index.php

function validate_url_fragment ($url)
{
  $regex = '/^(((\/?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)*'; //path
  $regex .= '(\?([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)'; //query string
  $regex .= '?)?)?'; //path and query string optional
  $regex .= '(#([a-z0-9$_\.\+!\*\'\(\),;:@&=-]|%[0-9a-f]{2})*)?'; //fragment
  $regex .= '$/i';

  return (preg_match($regex, $url) ? true : false);
}

if (validate_url_fragment($url) || validate_url($url)) {
  //is url
} else {
  //not url
}

(note that the empty string is valid, so you may want a special case for that)

许你一世情深 2024-10-13 20:29:25

filter_var 应该对单个 URL 执行您想要的操作:

<?php
$safe_url = filter_var( $unsafe_url, FILTER_SANITIZE_URL );
?>

filter_var should do what you want for a single URL:

<?php
$safe_url = filter_var( $unsafe_url, FILTER_SANITIZE_URL );
?>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文