清理字符串以使其 URL 和文件名安全?

发布于 2024-08-29 13:56:08 字数 906 浏览 8 评论 0 原文

我正在尝试提出一个函数,可以很好地清理某些字符串,以便它们可以安全地在 URL 中使用(如 post slug),并且也可以安全地用作文件名。例如,当有人上传文件时,我想确保删除名称中的所有危险字符。

到目前为止,我已经提出了以下函数,我希望它能解决这个问题并允许外部 UTF-8 数据。

/**
 * Convert a string to the file/URL safe "slug" form
 *
 * @param string $string the string to clean
 * @param bool $is_filename TRUE will allow additional filename characters
 * @return string
 */
function sanitize($string = '', $is_filename = FALSE)
{
 // Replace all weird characters with dashes
 $string = preg_replace('/[^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u', '-', $string);

 // Only allow one dash separator at a time (and make string lowercase)
 return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');
}

是否有人有任何棘手的示例数据,我可以针对此运行 - 或者知道更好的方法来保护我们的应用程序免受不良名称的影响?

$is-filename 允许一些附加字符,例如临时 vim 文件< /em>

更新:删除了星号,因为我想不出有效的用途

I am trying to come up with a function that does a good job of sanitizing certain strings so that they are safe to use in the URL (like a post slug) and also safe to use as file names. For example, when someone uploads a file I want to make sure that I remove all dangerous characters from the name.

So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also.

/**
 * Convert a string to the file/URL safe "slug" form
 *
 * @param string $string the string to clean
 * @param bool $is_filename TRUE will allow additional filename characters
 * @return string
 */
function sanitize($string = '', $is_filename = FALSE)
{
 // Replace all weird characters with dashes
 $string = preg_replace('/[^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u', '-', $string);

 // Only allow one dash separator at a time (and make string lowercase)
 return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');
}

Does anyone have any tricky sample data I can run against this - or know of a better way to safeguard our apps from bad names?

$is-filename allows some additional characters like temp vim files

update: removed the star character since I could not think of a valid use

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(23

鸩远一方 2024-09-05 13:56:08

我在 Chyrp 代码中找到了这个更大的函数:

/**
 * Function: sanitize
 * Returns a sanitized string, typically for URLs.
 *
 * Parameters:
 *     $string - The string to sanitize.
 *     $force_lowercase - Force the string to lowercase?
 *     $anal - If set to *true*, will remove all non-alphanumeric characters.
 */
function sanitize($string, $force_lowercase = true, $anal = false) {
    $strip = array("~", "`", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "=", "+", "[", "{", "]",
                   "}", "\\", "|", ";", ":", "\"", "'", "‘", "’", "“", "”", "–", "—",
                   "—", "–", ",", "<", ".", ">", "/", "?");
    $clean = trim(str_replace($strip, "", strip_tags($string)));
    $clean = preg_replace('/\s+/', "-", $clean);
    $clean = ($anal) ? preg_replace("/[^a-zA-Z0-9]/", "", $clean) : $clean ;
    return ($force_lowercase) ?
        (function_exists('mb_strtolower')) ?
            mb_strtolower($clean, 'UTF-8') :
            strtolower($clean) :
        $clean;
}

和此代码位于

/**
 * Sanitizes a filename replacing whitespace with dashes
 *
 * Removes special characters that are illegal in filenames on certain
 * operating systems and special characters requiring special escaping
 * to manipulate at the command line. Replaces spaces and consecutive
 * dashes with a single dash. Trim period, dash and underscore from beginning
 * and end of filename.
 *
 * @since 2.1.0
 *
 * @param string $filename The filename to be sanitized
 * @return string The sanitized filename
 */
function sanitize_file_name( $filename ) {
    $filename_raw = $filename;
    $special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}");
    $special_chars = apply_filters('sanitize_file_name_chars', $special_chars, $filename_raw);
    $filename = str_replace($special_chars, '', $filename);
    $filename = preg_replace('/[\s-]+/', '-', $filename);
    $filename = trim($filename, '.-_');
    return apply_filters('sanitize_file_name', $filename, $filename_raw);
}

2012 年 9 月更新的

wordpress 代码Alix Axel 在这个领域做了一些令人难以置信的工作。他的语音输入框架包括几个很棒的文本过滤器和转换。

I found this larger function in the Chyrp code:

/**
 * Function: sanitize
 * Returns a sanitized string, typically for URLs.
 *
 * Parameters:
 *     $string - The string to sanitize.
 *     $force_lowercase - Force the string to lowercase?
 *     $anal - If set to *true*, will remove all non-alphanumeric characters.
 */
function sanitize($string, $force_lowercase = true, $anal = false) {
    $strip = array("~", "`", "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "_", "=", "+", "[", "{", "]",
                   "}", "\\", "|", ";", ":", "\"", "'", "‘", "’", "“", "”", "–", "—",
                   "—", "–", ",", "<", ".", ">", "/", "?");
    $clean = trim(str_replace($strip, "", strip_tags($string)));
    $clean = preg_replace('/\s+/', "-", $clean);
    $clean = ($anal) ? preg_replace("/[^a-zA-Z0-9]/", "", $clean) : $clean ;
    return ($force_lowercase) ?
        (function_exists('mb_strtolower')) ?
            mb_strtolower($clean, 'UTF-8') :
            strtolower($clean) :
        $clean;
}

and this one in the wordpress code

/**
 * Sanitizes a filename replacing whitespace with dashes
 *
 * Removes special characters that are illegal in filenames on certain
 * operating systems and special characters requiring special escaping
 * to manipulate at the command line. Replaces spaces and consecutive
 * dashes with a single dash. Trim period, dash and underscore from beginning
 * and end of filename.
 *
 * @since 2.1.0
 *
 * @param string $filename The filename to be sanitized
 * @return string The sanitized filename
 */
function sanitize_file_name( $filename ) {
    $filename_raw = $filename;
    $special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}");
    $special_chars = apply_filters('sanitize_file_name_chars', $special_chars, $filename_raw);
    $filename = str_replace($special_chars, '', $filename);
    $filename = preg_replace('/[\s-]+/', '-', $filename);
    $filename = trim($filename, '.-_');
    return apply_filters('sanitize_file_name', $filename, $filename_raw);
}

Update Sept 2012

Alix Axel has done some incredible work in this area. His phunction framework includes several great text filters and transformations.

2024-09-05 13:56:08

对您的解决方案的一些观察:

  1. 模式末尾的“u”意味着模式,而不是它匹配的文本将被解释为UTF-8(我想您假设是后者?)。
  2. \w 匹配下划线字符。您专门将其包含在文件中,这会导致假设您不希望它们出现在 URL 中,但在代码中,您的 URL 将被允许包含下划线。
  3. 包含“foreign UTF-8”似乎与区域设置相关。尚不清楚这是服务器还是客户端的区域设置。来自 PHP 文档:

<块引用>

“单词”字符是任何字母、数字或下划线字符,即可以是 Perl“单词”一部分的任何字符。字母和数字的定义由 PCRE 的字符表控制,并且如果发生特定于区域设置的匹配,则可能会有所不同。例如,在“fr”(法语)语言环境中,一些大于 128 的字符代码用于重音字母,并且这些字符代码与 \w 匹配。

创建 slug

您可能不应该在帖子 slug 中包含重音等字符,因为从技术上讲,它们应该进行百分比编码(根据 URL 编码规则),因此您将获得难看的 URL。

所以,如果我是你,在小写后,我会将任何“特殊”字符转换为它们的等效字符(例如 é -> e),并将非 [az] 字符替换为“-”,限制为单个“-”的运行'就像你所做的那样。这里有一个转换特殊字符的实现: https://web. archive.org/web/20130208144021/http://neo22s.com/slug

总体清理

OWASP 有其企业安全 API 的 PHP 实现,其中包括用于安全编码和解码输入和输出的方法应用。

编码器接口提供:

canonicalize (string $input, [bool $strict = true])
decodeFromBase64 (string $input)
decodeFromURL (string $input)
encodeForBase64 (string $input, [bool $wrap = false])
encodeForCSS (string $input)
encodeForHTML (string $input)
encodeForHTMLAttribute (string $input)
encodeForJavaScript (string $input)
encodeForOS (Codec $codec, string $input)
encodeForSQL (Codec $codec, string $input)
encodeForURL (string $input)
encodeForVBScript (string $input)
encodeForXML (string $input)
encodeForXMLAttribute (string $input)
encodeForXPath (string $input)

https://github.com/OWASP/PHP-ESAPI
https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API

Some observations on your solution:

  1. 'u' at the end of your pattern means that the pattern, and not the text it's matching will be interpreted as UTF-8 (I presume you assumed the latter?).
  2. \w matches the underscore character. You specifically include it for files which leads to the assumption that you don't want them in URLs, but in the code you have URLs will be permitted to include an underscore.
  3. The inclusion of "foreign UTF-8" seems to be locale-dependent. It's not clear whether this is the locale of the server or client. From the PHP docs:

A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

Creating the slug

You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.

So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: https://web.archive.org/web/20130208144021/http://neo22s.com/slug

Sanitization in general

OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.

The Encoder interface provides:

canonicalize (string $input, [bool $strict = true])
decodeFromBase64 (string $input)
decodeFromURL (string $input)
encodeForBase64 (string $input, [bool $wrap = false])
encodeForCSS (string $input)
encodeForHTML (string $input)
encodeForHTMLAttribute (string $input)
encodeForJavaScript (string $input)
encodeForOS (Codec $codec, string $input)
encodeForSQL (Codec $codec, string $input)
encodeForURL (string $input)
encodeForVBScript (string $input)
encodeForXML (string $input)
encodeForXMLAttribute (string $input)
encodeForXPath (string $input)

https://github.com/OWASP/PHP-ESAPI
https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API

枕梦 2024-09-05 13:56:08

这应该使您的文件名安全......

$string = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $string);

而更深入的解决方案是:

// Remove special accented characters - ie. sí.
$clean_name = strtr($string, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E','É' => 'E','Ê' => 'E','Ë' => 'E','Ì' => 'I','Í' => 'I','Î' => 'I','Ï' => 'I','Ñ' => 'N','Ò' => 'O','Ó' => 'O','Ô' => 'O','Õ' => 'O','Ö' => 'O','Ø' => 'O','Ù' => 'U','Ú' => 'U','Û' => 'U','Ü' => 'U','Ý' => 'Y','à' => 'a','á' => 'a','â' => 'a','ã' => 'a','ä' => 'a','å' => 'a','ç' => 'c','è' => 'e','é' => 'e','ê' => 'e','ë' => 'e','ì' => 'i','í' => 'i','î' => 'i','ï' => 'i','ñ' => 'n','ò' => 'o','ó' => 'o','ô' => 'o','õ' => 'o','ö' => 'o','ø' => 'o','ù' => 'u','ú' => 'u','û' => 'u','ü' => 'u','ý' => 'y','ÿ' => 'y'));
$clean_name = strtr($clean_name, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));

$clean_name = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $clean_name);

这假设您希望文件名中包含一个点。
如果您希望将其转换为小写,只需

$clean_name = strtolower($clean_name);

在最后一行使用即可。

This should make your filenames safe...

$string = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $string);

and a deeper solution to this is:

// Remove special accented characters - ie. sí.
$clean_name = strtr($string, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E','É' => 'E','Ê' => 'E','Ë' => 'E','Ì' => 'I','Í' => 'I','Î' => 'I','Ï' => 'I','Ñ' => 'N','Ò' => 'O','Ó' => 'O','Ô' => 'O','Õ' => 'O','Ö' => 'O','Ø' => 'O','Ù' => 'U','Ú' => 'U','Û' => 'U','Ü' => 'U','Ý' => 'Y','à' => 'a','á' => 'a','â' => 'a','ã' => 'a','ä' => 'a','å' => 'a','ç' => 'c','è' => 'e','é' => 'e','ê' => 'e','ë' => 'e','ì' => 'i','í' => 'i','î' => 'i','ï' => 'i','ñ' => 'n','ò' => 'o','ó' => 'o','ô' => 'o','õ' => 'o','ö' => 'o','ø' => 'o','ù' => 'u','ú' => 'u','û' => 'u','ü' => 'u','ý' => 'y','ÿ' => 'y'));
$clean_name = strtr($clean_name, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));

$clean_name = preg_replace(array('/\s/', '/\.[\.]+/', '/[^\w_\.\-]/'), array('_', '.', ''), $clean_name);

This assumes that you want a dot in the filename.
if you want it transferred to lowercase, just use

$clean_name = strtolower($clean_name);

for the last line.

殊姿 2024-09-05 13:56:08

试试这个:

function normal_chars($string)
{
    $string = htmlentities($string, ENT_QUOTES, 'UTF-8');
    $string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', $string);
    $string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
    $string = preg_replace(array('~[^0-9a-z]~i', '~[ -]+~'), ' ', $string);

    return trim($string, ' -');
}

Examples:

echo normal_chars('Álix----_Ãxel!?!?'); // Alix Axel
echo normal_chars('áéíóúÁÉÍÓÚ'); // aeiouAEIOU
echo normal_chars('üÿÄËÏÖÜŸåÅ'); // uyAEIOUYaA

根据此线程中选定的答案:PHP 中的 URL 友好用户名?

Try this:

function normal_chars($string)
{
    $string = htmlentities($string, ENT_QUOTES, 'UTF-8');
    $string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', $string);
    $string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');
    $string = preg_replace(array('~[^0-9a-z]~i', '~[ -]+~'), ' ', $string);

    return trim($string, ' -');
}

Examples:

echo normal_chars('Álix----_Ãxel!?!?'); // Alix Axel
echo normal_chars('áéíóúÁÉÍÓÚ'); // aeiouAEIOU
echo normal_chars('üÿÄËÏÖÜŸåÅ'); // uyAEIOUYaA

Based on the selected answer in this thread: URL Friendly Username in PHP?

早茶月光 2024-09-05 13:56:08

这不完全是一个答案,因为它没有提供任何解决方案(还!),但它太大了,无法容纳评论......


我在 Windows 上做了一些测试(关于文件名) 7 和 Ubuntu 12.04,我发现:

1. PHP 无法处理非 ASCII 文件名

虽然 Windows 和 Ubuntu 都可以处理 Unicode 文件名(甚至看起来是 RTL 文件名),但 PHP 5.3 甚至需要一些技巧来处理普通的旧版 ISO-8859-1,所以它更好仅出于安全考虑才将其保留为 ASCII。

2.文件名的长度很重要(特别是在 Windows 上)

在 Ubuntu 上,文件名的最大长度(包括扩展名)为 255(不包括路径):

/var/www/uploads/123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345/

但是,在 Windows 7 (NTFS) 上文件名的最大长度取决于它的绝对路径:

(0 + 0 + 244 + 11 chars) C:\1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234\1234567.txt
(0 + 3 + 240 + 11 chars) C:\123\123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\1234567.txt
(3 + 3 + 236 + 11 chars) C:\123\456\12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456\1234567.txt

维基百科 说:

NTFS 允许每个路径组件(目录或文件名)为 255
字符长。

据我所知(和测试),这是错误的。

总共(算上斜杠)所有这些示例都有 259 个字符,如果您去掉 C:\给出 256 个字符(不是 255?!)。使用资源管理器创建的目录,您会注意到它限制自己使用目录名称的所有可用空间。这样做的原因是允许使用 8.3 文件命名约定 创建文件。其他分区也会发生同样的情况。

文件当然不需要保留 8.3 长度要求:

(255 chars) E:\12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901.txt

如果父目录的绝对路径超过 242 个字符,则不能再创建任何子目录,因为 256 = 242 + 1 + \ + 8+. + 3。使用 Windows 资源管理器,如果父目录超过 233 个字符(取决于系统区域设置),则无法创建另一个目录,因为 256 = 233 + 10 + \ + 8 + 。 + 3;这里的10是字符串Newfolder的长度。

如果您想确保文件系统之间的互操作性,Windows 文件系统会带来一个棘手的问题。

3.谨防保留字符和关键字

除了删除非 ASCII、不可打印和控制字符之外,您还需要重新(放置/移动):

"*/:<>?\|

仅仅删除这些字符可能并不最好的主意,因为文件名可能会失去一些含义。我认为,至少,这些字符的多次出现应该替换为单个下划线(_),或者也许更具代表性的东西(这只是一个想法):

  • "*? -> _
  • /\| -> -
  • : -> [ ]-[ ]
  • < -> ( >
  • -> )

还有特殊关键字避免(如NUL),尽管我不确定如何克服这个问题,也许带有随机名称后备的黑名单是解决这个问题的好方法

4 。大小写敏感

这应该是不言而喻的,但如果您想确保文件在不同操作系统之间的唯一性,您应该将文件名转换为规范化的大小写,这样 my_file.txt Linux 上的 >My_File.txt 在 Windows 上不会成为相同的 my_file.txt 文件。

5.确保它是唯一的

如果文件名已经存在,则唯一标识符应该是附加到它的基本文件名。

常见的唯一标识符包括 UNIX 时间戳、文件内容的摘要或随机字符串。

6.隐藏文件

仅仅因为它可以命名并不意味着它应该...

点通常在文件名中列入白名单,但在 Linux 中,隐藏文件由前导点表示。

7.其他注意事项

如果您必须删除文件名中的某些字符,则扩展名通常比文件的基本名称更重要。允许文件扩展名的最大字符数 (8-16)从基本名称中删除字符。还需要注意的是,万一出现多个长扩展名(例如 _.graphmlz.tag.gz - _.graphmlz.tag)时,仅<在这种情况下,code>_ 应被视为文件基本名称。

8.资源

Calibre 相当不错地处理文件名修改:

有关文件名修改的维基百科页面和链接的使用 Samba 的章节


例如,如果您尝试创建一个违反任何规则 1/2/3 的文件,您将收到一个非常有用的错误:

Warning: touch(): Unable to create file ... because No error in ... on line ...

This isn't exactly an answer as it doesn't provide any solutions (yet!), but it's too big to fit on a comment...


I did some testing (regarding file names) on Windows 7 and Ubuntu 12.04 and what I found out was that:

1. PHP Can't Handle non-ASCII Filenames

Although both Windows and Ubuntu can handle Unicode filenames (even RTL ones as it seems) PHP 5.3 requires hacks to deal even with the plain old ISO-8859-1, so it's better to keep it ASCII only for safety.

2. The Lenght of the Filename Matters (Specially on Windows)

On Ubuntu, the maximum length a filename can have (incluinding extension) is 255 (excluding path):

/var/www/uploads/123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345/

However, on Windows 7 (NTFS) the maximum lenght a filename can have depends on it's absolute path:

(0 + 0 + 244 + 11 chars) C:\1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234\1234567.txt
(0 + 3 + 240 + 11 chars) C:\123\123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\1234567.txt
(3 + 3 + 236 + 11 chars) C:\123\456\12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456\1234567.txt

Wikipedia says that:

NTFS allows each path component (directory or filename) to be 255
characters long.

To the best of my knowledge (and testing), this is wrong.

In total (counting slashes) all these examples have 259 chars, if you strip the C:\ that gives 256 characters (not 255?!). The directories where created using the Explorer and you'll notice that it restrains itself from using all the available space for the directory name. The reason for this is to allow the creation of files using the 8.3 file naming convention. The same thing happens for other partitions.

Files don't need to reserve the 8.3 lenght requirements of course:

(255 chars) E:\12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901.txt

You can't create any more sub-directories if the absolute path of the parent directory has more than 242 characters, because 256 = 242 + 1 + \ + 8 + . + 3. Using Windows Explorer, you can't create another directory if the parent directory has more than 233 characters (depending on the system locale), because 256 = 233 + 10 + \ + 8 + . + 3; the 10 here is the length of the string New folder.

Windows file system poses a nasty problem if you want to assure inter-operability between file systems.

3. Beware of Reserved Characters and Keywords

Aside from removing non-ASCII, non-printable and control characters, you also need to re(place/move):

"*/:<>?\|

Just removing these characters might not be the best idea because the filename might lose some of it's meaning. I think that, at the very least, multiple occurences of these characters should be replaced by a single underscore (_), or perhaps something more representative (this is just an idea):

  • "*? -> _
  • /\| -> -
  • : -> [ ]-[ ]
  • < -> (
  • > -> )

There are also special keywords that should be avoided (like NUL), although I'm not sure how to overcome that. Perhaps a black list with a random name fallback would be a good approach to solve it.

4. Case Sensitiveness

This should go without saying, but if you want so ensure file uniqueness across different operating systems you should transform file names to a normalized case, that way my_file.txt and My_File.txt on Linux won't both become the same my_file.txt file on Windows.

5. Make Sure It's Unique

If the file name already exists, a unique identifier should be appended to it's base file name.

Common unique identifiers include the UNIX timestamp, a digest of the file contents or a random string.

6. Hidden Files

Just because it can be named doesn't mean it should...

Dots are usually white-listed in file names but in Linux a hidden file is represented by a leading dot.

7. Other Considerations

If you have to strip some chars of the file name, the extension is usually more important than the base name of the file. Allowing a considerable maximum number of characters for the file extension (8-16) one should strip the characters from the base name. It's also important to note that in the unlikely event of having a more than one long extension - such as _.graphmlz.tag.gz - _.graphmlz.tag only _ should be considered as the file base name in this case.

8. Resources

Calibre handles file name mangling pretty decently:

Wikipedia page on file name mangling and linked chapter from Using Samba.


If for instance, you try to create a file that violates any of the rules 1/2/3, you'll get a very useful error:

Warning: touch(): Unable to create file ... because No error in ... on line ...
浅语花开 2024-09-05 13:56:08

我一直认为 Kohana 做得很好

public static function title($title, $separator = '-', $ascii_only = FALSE)
{
if ($ascii_only === TRUE)
{
// Transliterate non-ASCII characters
$title = UTF8::transliterate_to_ascii($title);

// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$title = preg_replace('![^'.preg_quote($separator).'a-z0-9\s]+!', '', strtolower($title));
}
else
{
// Remove all characters that are not the separator, letters, numbers, or whitespace
$title = preg_replace('![^'.preg_quote($separator).'\pL\pN\s]+!u', '', UTF8::strtolower($title));
}

// Replace all separator characters and whitespace by a single separator
$title = preg_replace('!['.preg_quote($separator).'\s]+!u', $separator, $title);

// Trim separators from the beginning and end
return trim($title, $separator);
}

方便的 UTF8::transliterate_to_ascii() 会将 ñ => 之类的内容转换为 ñ => 。名词

当然,您可以用 mb_* 函数替换其他 UTF8::* 内容。

I've always thought Kohana did a pretty good job of it.

public static function title($title, $separator = '-', $ascii_only = FALSE)
{
if ($ascii_only === TRUE)
{
// Transliterate non-ASCII characters
$title = UTF8::transliterate_to_ascii($title);

// Remove all characters that are not the separator, a-z, 0-9, or whitespace
$title = preg_replace('![^'.preg_quote($separator).'a-z0-9\s]+!', '', strtolower($title));
}
else
{
// Remove all characters that are not the separator, letters, numbers, or whitespace
$title = preg_replace('![^'.preg_quote($separator).'\pL\pN\s]+!u', '', UTF8::strtolower($title));
}

// Replace all separator characters and whitespace by a single separator
$title = preg_replace('!['.preg_quote($separator).'\s]+!u', $separator, $title);

// Trim separators from the beginning and end
return trim($title, $separator);
}

The handy UTF8::transliterate_to_ascii() will turn stuff like ñ => n.

Of course, you could replace the other UTF8::* stuff with mb_* functions.

七婞 2024-09-05 13:56:08

我从另一个来源改编而来,并添加了一些额外的内容,也许有点矫枉过正

/**
 * Convert a string into a url safe address.
 *
 * @param string $unformatted
 * @return string
 */
public function formatURL($unformatted) {

    $url = strtolower(trim($unformatted));

    //replace accent characters, forien languages
    $search = array('À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'Ā', 'ā', 'Ă', 'ă', 'Ą', 'ą', 'Ć', 'ć', 'Ĉ', 'ĉ', 'Ċ', 'ċ', 'Č', 'č', 'Ď', 'ď', 'Đ', 'đ', 'Ē', 'ē', 'Ĕ', 'ĕ', 'Ė', 'ė', 'Ę', 'ę', 'Ě', 'ě', 'Ĝ', 'ĝ', 'Ğ', 'ğ', 'Ġ', 'ġ', 'Ģ', 'ģ', 'Ĥ', 'ĥ', 'Ħ', 'ħ', 'Ĩ', 'ĩ', 'Ī', 'ī', 'Ĭ', 'ĭ', 'Į', 'į', 'İ', 'ı', 'IJ', 'ij', 'Ĵ', 'ĵ', 'Ķ', 'ķ', 'Ĺ', 'ĺ', 'Ļ', 'ļ', 'Ľ', 'ľ', 'Ŀ', 'ŀ', 'Ł', 'ł', 'Ń', 'ń', 'Ņ', 'ņ', 'Ň', 'ň', 'ʼn', 'Ō', 'ō', 'Ŏ', 'ŏ', 'Ő', 'ő', 'Œ', 'œ', 'Ŕ', 'ŕ', 'Ŗ', 'ŗ', 'Ř', 'ř', 'Ś', 'ś', 'Ŝ', 'ŝ', 'Ş', 'ş', 'Š', 'š', 'Ţ', 'ţ', 'Ť', 'ť', 'Ŧ', 'ŧ', 'Ũ', 'ũ', 'Ū', 'ū', 'Ŭ', 'ŭ', 'Ů', 'ů', 'Ű', 'ű', 'Ų', 'ų', 'Ŵ', 'ŵ', 'Ŷ', 'ŷ', 'Ÿ', 'Ź', 'ź', 'Ż', 'ż', 'Ž', 'ž', 'ſ', 'ƒ', 'Ơ', 'ơ', 'Ư', 'ư', 'Ǎ', 'ǎ', 'Ǐ', 'ǐ', 'Ǒ', 'ǒ', 'Ǔ', 'ǔ', 'Ǖ', 'ǖ', 'Ǘ', 'ǘ', 'Ǚ', 'ǚ', 'Ǜ', 'ǜ', 'Ǻ', 'ǻ', 'Ǽ', 'ǽ', 'Ǿ', 'ǿ'); 
    $replace = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o'); 
    $url = str_replace($search, $replace, $url);

    //replace common characters
    $search = array('&', '£', '
); 
    $replace = array('and', 'pounds', 'dollars'); 
    $url= str_replace($search, $replace, $url);

    // remove - for spaces and union characters
    $find = array(' ', '&', '\r\n', '\n', '+', ',', '//');
    $url = str_replace($find, '-', $url);

    //delete and replace rest of special chars
    $find = array('/[^a-z0-9\-<>]/', '/[\-]+/', '/<[^>]*>/');
    $replace = array('', '-', '');
    $uri = preg_replace($find, $replace, $url);

    return $uri;
}

I have adapted from another source and added a couple extra, maybe a little overkill

/**
 * Convert a string into a url safe address.
 *
 * @param string $unformatted
 * @return string
 */
public function formatURL($unformatted) {

    $url = strtolower(trim($unformatted));

    //replace accent characters, forien languages
    $search = array('À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ç', 'È', 'É', 'Ê', 'Ë', 'Ì', 'Í', 'Î', 'Ï', 'Ð', 'Ñ', 'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ù', 'Ú', 'Û', 'Ü', 'Ý', 'ß', 'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ç', 'è', 'é', 'ê', 'ë', 'ì', 'í', 'î', 'ï', 'ñ', 'ò', 'ó', 'ô', 'õ', 'ö', 'ø', 'ù', 'ú', 'û', 'ü', 'ý', 'ÿ', 'Ā', 'ā', 'Ă', 'ă', 'Ą', 'ą', 'Ć', 'ć', 'Ĉ', 'ĉ', 'Ċ', 'ċ', 'Č', 'č', 'Ď', 'ď', 'Đ', 'đ', 'Ē', 'ē', 'Ĕ', 'ĕ', 'Ė', 'ė', 'Ę', 'ę', 'Ě', 'ě', 'Ĝ', 'ĝ', 'Ğ', 'ğ', 'Ġ', 'ġ', 'Ģ', 'ģ', 'Ĥ', 'ĥ', 'Ħ', 'ħ', 'Ĩ', 'ĩ', 'Ī', 'ī', 'Ĭ', 'ĭ', 'Į', 'į', 'İ', 'ı', 'IJ', 'ij', 'Ĵ', 'ĵ', 'Ķ', 'ķ', 'Ĺ', 'ĺ', 'Ļ', 'ļ', 'Ľ', 'ľ', 'Ŀ', 'ŀ', 'Ł', 'ł', 'Ń', 'ń', 'Ņ', 'ņ', 'Ň', 'ň', 'ʼn', 'Ō', 'ō', 'Ŏ', 'ŏ', 'Ő', 'ő', 'Œ', 'œ', 'Ŕ', 'ŕ', 'Ŗ', 'ŗ', 'Ř', 'ř', 'Ś', 'ś', 'Ŝ', 'ŝ', 'Ş', 'ş', 'Š', 'š', 'Ţ', 'ţ', 'Ť', 'ť', 'Ŧ', 'ŧ', 'Ũ', 'ũ', 'Ū', 'ū', 'Ŭ', 'ŭ', 'Ů', 'ů', 'Ű', 'ű', 'Ų', 'ų', 'Ŵ', 'ŵ', 'Ŷ', 'ŷ', 'Ÿ', 'Ź', 'ź', 'Ż', 'ż', 'Ž', 'ž', 'ſ', 'ƒ', 'Ơ', 'ơ', 'Ư', 'ư', 'Ǎ', 'ǎ', 'Ǐ', 'ǐ', 'Ǒ', 'ǒ', 'Ǔ', 'ǔ', 'Ǖ', 'ǖ', 'Ǘ', 'ǘ', 'Ǚ', 'ǚ', 'Ǜ', 'ǜ', 'Ǻ', 'ǻ', 'Ǽ', 'ǽ', 'Ǿ', 'ǿ'); 
    $replace = array('A', 'A', 'A', 'A', 'A', 'A', 'AE', 'C', 'E', 'E', 'E', 'E', 'I', 'I', 'I', 'I', 'D', 'N', 'O', 'O', 'O', 'O', 'O', 'O', 'U', 'U', 'U', 'U', 'Y', 's', 'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'c', 'e', 'e', 'e', 'e', 'i', 'i', 'i', 'i', 'n', 'o', 'o', 'o', 'o', 'o', 'o', 'u', 'u', 'u', 'u', 'y', 'y', 'A', 'a', 'A', 'a', 'A', 'a', 'C', 'c', 'C', 'c', 'C', 'c', 'C', 'c', 'D', 'd', 'D', 'd', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'E', 'e', 'G', 'g', 'G', 'g', 'G', 'g', 'G', 'g', 'H', 'h', 'H', 'h', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'I', 'i', 'IJ', 'ij', 'J', 'j', 'K', 'k', 'L', 'l', 'L', 'l', 'L', 'l', 'L', 'l', 'l', 'l', 'N', 'n', 'N', 'n', 'N', 'n', 'n', 'O', 'o', 'O', 'o', 'O', 'o', 'OE', 'oe', 'R', 'r', 'R', 'r', 'R', 'r', 'S', 's', 'S', 's', 'S', 's', 'S', 's', 'T', 't', 'T', 't', 'T', 't', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'W', 'w', 'Y', 'y', 'Y', 'Z', 'z', 'Z', 'z', 'Z', 'z', 's', 'f', 'O', 'o', 'U', 'u', 'A', 'a', 'I', 'i', 'O', 'o', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'U', 'u', 'A', 'a', 'AE', 'ae', 'O', 'o'); 
    $url = str_replace($search, $replace, $url);

    //replace common characters
    $search = array('&', '£', '
); 
    $replace = array('and', 'pounds', 'dollars'); 
    $url= str_replace($search, $replace, $url);

    // remove - for spaces and union characters
    $find = array(' ', '&', '\r\n', '\n', '+', ',', '//');
    $url = str_replace($find, '-', $url);

    //delete and replace rest of special chars
    $find = array('/[^a-z0-9\-<>]/', '/[\-]+/', '/<[^>]*>/');
    $replace = array('', '-', '');
    $uri = preg_replace($find, $replace, $url);

    return $uri;
}
后来的我们 2024-09-05 13:56:08

就文件上传而言,防止用户控制文件名是最安全的。正如已经暗示的那样,将规范化的文件名与随机选择的唯一名称一起存储在数据库中,您将使用该名称作为实际文件名。

使用 OWASP ESAPI,可以这样生成这些名称:

$userFilename   = ESAPI::getEncoder()->canonicalize($input_string);
$safeFilename   = ESAPI::getRandomizer()->getRandomFilename();

您可以将时间戳附加到 $safeFilename,以帮助确保随机生成的文件名是唯一的,甚至无需检查现有文件。

在 URL 编码方面,再次使用 ESAPI:

$safeForURL     = ESAPI::getEncoder()->encodeForURL($input_string);

此方法在对字符串进行编码之前执行规范化,并将处理所有字符编码。

In terms of file uploads, you would be safest to prevent the user from controlling the file name. As has already been hinted at, store the canonicalised filename in a database along with a randomly chosen and unique name which you'll use as the actual filename.

Using OWASP ESAPI, these names could be generated thus:

$userFilename   = ESAPI::getEncoder()->canonicalize($input_string);
$safeFilename   = ESAPI::getRandomizer()->getRandomFilename();

You could append a timestamp to the $safeFilename to help ensure that the randomly generated filename is unique without even checking for an existing file.

In terms of encoding for URL, and again using ESAPI:

$safeForURL     = ESAPI::getEncoder()->encodeForURL($input_string);

This method performs canonicalisation before encoding the string and will handle all character encodings.

水波映月 2024-09-05 13:56:08

我推荐* URLify for PHP(Github 上有 480+ 颗星) - “来自 URLify.js 的 PHP 端口Django 项目。音译非 ASCII 字符以在 URL 中使用”。

基本用法:

为 URL 生成 slugs:

<?php

echo URLify::filter (' J\'étudie le français ');
// "jetudie-le-francais"

echo URLify::filter ('Lo siento, no hablo español.');
// "lo-siento-no-hablo-espanol"

?>

为文件名生成 slugs:

<?php

echo URLify::filter ('фото.jpg', 60, "", true);
// "foto.jpg"

?>

*其他建议都不符合我的标准:

  • 应该可以通过 Composer 安装
  • 不应该依赖 iconv因为它在不同系统上的行为不同
  • 应该可扩展以允许覆盖和自定义字符替换
  • 流行(例如 Github 上的许多星星)
  • 有测试

作为奖励,URLify 还会删除某些单词并删除所有未音译的字符。

这是一个使用 URLify 正确音译大量外来字符的测试用例: https://gist.github.com /motin/a65e6c1cc303e46900d10894bf2da87f

I recommend* URLify for PHP (480+ stars on Github) - "the PHP port of URLify.js from the Django project. Transliterates non-ascii characters for use in URLs".

Basic usage:

To generate slugs for URLs:

<?php

echo URLify::filter (' J\'étudie le français ');
// "jetudie-le-francais"

echo URLify::filter ('Lo siento, no hablo español.');
// "lo-siento-no-hablo-espanol"

?>

To generate slugs for file names:

<?php

echo URLify::filter ('фото.jpg', 60, "", true);
// "foto.jpg"

?>

*None of the other suggestions matched my criteria:

  • Should be installable via composer
  • Should not depend on iconv since it behaves differently on different systems
  • Should be extendable to allow overrides and custom character replacements
  • Popular (for instance many stars on Github)
  • Has tests

As a bonus, URLify also removes certain words and strips away all characters not transliterated.

Here is a test case with tons of foreign characters being transliterated properly using URLify: https://gist.github.com/motin/a65e6c1cc303e46900d10894bf2da87f

夏至、离别 2024-09-05 13:56:08

这是来自 JFile::makeSafe 的 Joomla 3.3.2 版本($文件)

public static function makeSafe($file)
{
    // Remove any trailing dots, as those aren't ever valid file names.
    $file = rtrim($file, '.');

    $regex = array('#(\.){2,}#', '#[^A-Za-z0-9\.\_\- ]#', '#^\.#');

    return trim(preg_replace($regex, '', $file));
}

and this is Joomla 3.3.2 version from JFile::makeSafe($file)

public static function makeSafe($file)
{
    // Remove any trailing dots, as those aren't ever valid file names.
    $file = rtrim($file, '.');

    $regex = array('#(\.){2,}#', '#[^A-Za-z0-9\.\_\- ]#', '#^\.#');

    return trim(preg_replace($regex, '', $file));
}
冰葑 2024-09-05 13:56:08

根据您使用它的方式,您可能需要添加长度限制以防止缓冲区溢出。

Depending on how you will use it, you might want to add a length limit to protect against buffer overflows.

任谁 2024-09-05 13:56:08

我认为拥有要删除的字符列表并不安全。我宁愿使用以下内容:

对于文件名:使用内部 ID 或文件内容的哈希值。将文档名称保存在数据库中。这样您可以保留原始文件名并仍然可以找到该文件。

对于 url 参数:使用 urlencode() 对任何特殊字符进行编码。

I don't think having a list of chars to remove is safe. I would rather use the following:

For filenames: Use an internal ID or a hash of the filecontent. Save the document name in a database. This way you can keep the original filename and still find the file.

For url parameters: Use urlencode() to encode any special characters.

╭ゆ眷念 2024-09-05 13:56:08

这是保护上传文件名的好方法:

$file_name = trim(basename(stripslashes($name)), ".\x00..\x20");

This is a nice way to secure an upload filename:

$file_name = trim(basename(stripslashes($name)), ".\x00..\x20");
子栖 2024-09-05 13:56:08

这是 CodeIgniter 的实现。

/**
 * Sanitize Filename
 *
 * @param   string  $str        Input file name
 * @param   bool    $relative_path  Whether to preserve paths
 * @return  string
 */
public function sanitize_filename($str, $relative_path = FALSE)
{
    $bad = array(
        '../', '<!--', '-->', '<', '>',
        "'", '"', '&', '

以及 remove_invisible_characters 依赖项。

function remove_invisible_characters($str, $url_encoded = TRUE)
{
    $non_displayables = array();

    // every control character except newline (dec 10),
    // carriage return (dec 13) and horizontal tab (dec 09)
    if ($url_encoded)
    {
        $non_displayables[] = '/%0[0-8bcef]/';  // url encoded 00-08, 11, 12, 14, 15
        $non_displayables[] = '/%1[0-9a-f]/';   // url encoded 16-31
    }

    $non_displayables[] = '/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]+/S';   // 00-08, 11, 12, 14-31, 127

    do
    {
        $str = preg_replace($non_displayables, '', $str, -1, $count);
    }
    while ($count);

    return $str;
}
, '#', '{', '}', '[', ']', '=', ';', '?', '%20', '%22', '%3c', // < '%253c', // < '%3e', // > '%0e', // > '%28', // ( '%29', // ) '%2528', // ( '%26', // & '%24', // $ '%3f', // ? '%3b', // ; '%3d' // = ); if ( ! $relative_path) { $bad[] = './'; $bad[] = '/'; } $str = remove_invisible_characters($str, FALSE); return stripslashes(str_replace($bad, '', $str)); }

以及 remove_invisible_characters 依赖项。


Here's CodeIgniter's implementation.

/**
 * Sanitize Filename
 *
 * @param   string  $str        Input file name
 * @param   bool    $relative_path  Whether to preserve paths
 * @return  string
 */
public function sanitize_filename($str, $relative_path = FALSE)
{
    $bad = array(
        '../', '<!--', '-->', '<', '>',
        "'", '"', '&', '

And the remove_invisible_characters dependency.

function remove_invisible_characters($str, $url_encoded = TRUE)
{
    $non_displayables = array();

    // every control character except newline (dec 10),
    // carriage return (dec 13) and horizontal tab (dec 09)
    if ($url_encoded)
    {
        $non_displayables[] = '/%0[0-8bcef]/';  // url encoded 00-08, 11, 12, 14, 15
        $non_displayables[] = '/%1[0-9a-f]/';   // url encoded 16-31
    }

    $non_displayables[] = '/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]+/S';   // 00-08, 11, 12, 14-31, 127

    do
    {
        $str = preg_replace($non_displayables, '', $str, -1, $count);
    }
    while ($count);

    return $str;
}
, '#', '{', '}', '[', ']', '=', ';', '?', '%20', '%22', '%3c', // < '%253c', // < '%3e', // > '%0e', // > '%28', // ( '%29', // ) '%2528', // ( '%26', // & '%24', // $ '%3f', // ? '%3b', // ; '%3d' // = ); if ( ! $relative_path) { $bad[] = './'; $bad[] = '/'; } $str = remove_invisible_characters($str, FALSE); return stripslashes(str_replace($bad, '', $str)); }

And the remove_invisible_characters dependency.



    
-柠檬树下少年和吉他 2024-09-05 13:56:08

我的条目标题包含各种奇怪的拉丁字符以及一些 HTML 标签,我需要将它们转换为有用的破折号分隔的文件名格式。我将 @SoLoGHoST 的答案与 @Xeoncross 的答案中的几个项目结合起来,并进行了一些定制。

    function sanitize($string,$force_lowercase=true) {
    //Clean up titles for filenames
    $clean = strip_tags($string);
    $clean = strtr($clean, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E','É' => 'E','Ê' => 'E','Ë' => 'E','Ì' => 'I','Í' => 'I','Î' => 'I','Ï' => 'I','Ñ' => 'N','Ò' => 'O','Ó' => 'O','Ô' => 'O','Õ' => 'O','Ö' => 'O','Ø' => 'O','Ù' => 'U','Ú' => 'U','Û' => 'U','Ü' => 'U','Ý' => 'Y','à' => 'a','á' => 'a','â' => 'a','ã' => 'a','ä' => 'a','å' => 'a','ç' => 'c','è' => 'e','é' => 'e','ê' => 'e','ë' => 'e','ì' => 'i','í' => 'i','î' => 'i','ï' => 'i','ñ' => 'n','ò' => 'o','ó' => 'o','ô' => 'o','õ' => 'o','ö' => 'o','ø' => 'o','ù' => 'u','ú' => 'u','û' => 'u','ü' => 'u','ý' => 'y','ÿ' => 'y'));
    $clean = strtr($clean, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u','—' => '-'));
    $clean = str_replace("--", "-", preg_replace("/[^a-z0-9-]/i", "", preg_replace(array('/\s/', '/[^\w-\.\-]/'), array('-', ''), $clean)));

    return ($force_lowercase) ?
        (function_exists('mb_strtolower')) ?
            mb_strtolower($clean, 'UTF-8') :
            strtolower($clean) :
        $clean;
}

我需要手动将长破折号字符 (—) 添加到翻译数组中。可能还有其他的,但到目前为止我的文件名看起来不错。

所以:

第 1 部分:我爸爸的“Žurburts”?——他们(不是)最好的!

变成:

第 1 部分-my-dads-zurburts-theyre-not-the-best

我只是将“.html”添加到返回的字符串中。

I have entry titles with all kinds of weird latin characters as well as some HTML tags that I needed to translate into a useful dash-delimited filename format. I combined @SoLoGHoST's answer with a couple of items from @Xeoncross's answer and customized a bit.

    function sanitize($string,$force_lowercase=true) {
    //Clean up titles for filenames
    $clean = strip_tags($string);
    $clean = strtr($clean, array('Š' => 'S','Ž' => 'Z','š' => 's','ž' => 'z','Ÿ' => 'Y','À' => 'A','Á' => 'A','Â' => 'A','Ã' => 'A','Ä' => 'A','Å' => 'A','Ç' => 'C','È' => 'E','É' => 'E','Ê' => 'E','Ë' => 'E','Ì' => 'I','Í' => 'I','Î' => 'I','Ï' => 'I','Ñ' => 'N','Ò' => 'O','Ó' => 'O','Ô' => 'O','Õ' => 'O','Ö' => 'O','Ø' => 'O','Ù' => 'U','Ú' => 'U','Û' => 'U','Ü' => 'U','Ý' => 'Y','à' => 'a','á' => 'a','â' => 'a','ã' => 'a','ä' => 'a','å' => 'a','ç' => 'c','è' => 'e','é' => 'e','ê' => 'e','ë' => 'e','ì' => 'i','í' => 'i','î' => 'i','ï' => 'i','ñ' => 'n','ò' => 'o','ó' => 'o','ô' => 'o','õ' => 'o','ö' => 'o','ø' => 'o','ù' => 'u','ú' => 'u','û' => 'u','ü' => 'u','ý' => 'y','ÿ' => 'y'));
    $clean = strtr($clean, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u','—' => '-'));
    $clean = str_replace("--", "-", preg_replace("/[^a-z0-9-]/i", "", preg_replace(array('/\s/', '/[^\w-\.\-]/'), array('-', ''), $clean)));

    return ($force_lowercase) ?
        (function_exists('mb_strtolower')) ?
            mb_strtolower($clean, 'UTF-8') :
            strtolower($clean) :
        $clean;
}

I needed to manually add the em dash character (—) to the translation array. There may be others but so far my file names are looking good.

So:

Part 1: My dad’s “Žurburts”?—they’re (not) the best!

becomes:

part-1-my-dads-zurburts-theyre-not-the-best

I just add ".html" to the returned string.

心房的律动 2024-09-05 13:56:08

为什么不简单地使用 php 的 urlencode 呢?它将“危险”字符替换为 url 的十六进制表示形式(即 %20 表示空格)

why not simply use php's urlencode? it replaces "dangerous" characters with their hex representation for urls (i.e. %20 for a space)

囍笑 2024-09-05 13:56:08

已经为这个问题提供了几个解决方案,但我已经阅读并测试了这里的大部分代码,我最终得到了这个解决方案,它是我在这里学到的内容的混合:

该函数

该函数被捆绑在 Symfony2 中 捆绑包,但可以将其提取出来用作纯 PHP,它只依赖于必须启用的 iconv 函数:

Filesystem. php:

<?php

namespace COil\Bundle\COilCoreBundle\Component\HttpKernel\Util;

use Symfony\Component\HttpKernel\Util\Filesystem as BaseFilesystem;

/**
 * Extends the Symfony filesystem object.
 */
class Filesystem extends BaseFilesystem
{
    /**
     * Make a filename safe to use in any function. (Accents, spaces, special chars...)
     * The iconv function must be activated.
     *
     * @param string  $fileName       The filename to sanitize (with or without extension)
     * @param string  $defaultIfEmpty The default string returned for a non valid filename (only special chars or separators)
     * @param string  $separator      The default separator
     * @param boolean $lowerCase      Tells if the string must converted to lower case
     *
     * @author COil <https://github.com/COil>
     * @see    http://stackoverflow.com/questions/2668854/sanitizing-strings-to-make-them-url-and-filename-safe
     *
     * @return string
     */
    public function sanitizeFilename($fileName, $defaultIfEmpty = 'default', $separator = '_', $lowerCase = true)
    {
    // Gather file informations and store its extension
    $fileInfos = pathinfo($fileName);
    $fileExt   = array_key_exists('extension', $fileInfos) ? '.'. strtolower($fileInfos['extension']) : '';

    // Removes accents
    $fileName = @iconv('UTF-8', 'us-ascii//TRANSLIT', $fileInfos['filename']);

    // Removes all characters that are not separators, letters, numbers, dots or whitespaces
    $fileName = preg_replace("/[^ a-zA-Z". preg_quote($separator). "\d\.\s]/", '', $lowerCase ? strtolower($fileName) : $fileName);

    // Replaces all successive separators into a single one
    $fileName = preg_replace('!['. preg_quote($separator).'\s]+!u', $separator, $fileName);

    // Trim beginning and ending seperators
    $fileName = trim($fileName, $separator);

    // If empty use the default string
    if (empty($fileName)) {
        $fileName = $defaultIfEmpty;
    }

    return $fileName. $fileExt;
    }
}

单元测试

有趣的是,我创建了 PHPUnit 测试,首先是测试边缘情况,这样您就可以检查它是否满足您的需求:
(如果您发现错误,请随时添加测试用例)

FilesystemTest.php

<?php

namespace COil\Bundle\COilCoreBundle\Tests\Unit\Helper;

use COil\Bundle\COilCoreBundle\Component\HttpKernel\Util\Filesystem;

/**
 * Test the Filesystem custom class.
 */
class FilesystemTest extends \PHPUnit_Framework_TestCase
{
    /**
     * test sanitizeFilename()
     */
    public function testFilesystem()
    {
    $fs = new Filesystem();

    $this->assertEquals('logo_orange.gif', $fs->sanitizeFilename('--logö  _  __   ___   ora@@ñ--~gé--.gif'), '::sanitizeFilename() handles complex filename with specials chars');
    $this->assertEquals('coilstack', $fs->sanitizeFilename('cOiLsTaCk'), '::sanitizeFilename() converts all characters to lower case');
    $this->assertEquals('cOiLsTaCk', $fs->sanitizeFilename('cOiLsTaCk', 'default', '_', false), '::sanitizeFilename() lower case can be desactivated, passing false as the 4th argument');
    $this->assertEquals('coil_stack', $fs->sanitizeFilename('coil stack'), '::sanitizeFilename() convert a white space to a separator');
    $this->assertEquals('coil-stack', $fs->sanitizeFilename('coil stack', 'default', '-'), '::sanitizeFilename() can use a different separator as the 3rd argument');
    $this->assertEquals('coil_stack', $fs->sanitizeFilename('coil          stack'), '::sanitizeFilename() removes successive white spaces to a single separator');
    $this->assertEquals('coil_stack', $fs->sanitizeFilename('       coil stack'), '::sanitizeFilename() removes spaces at the beginning of the string');
    $this->assertEquals('coil_stack', $fs->sanitizeFilename('coil   stack         '), '::sanitizeFilename() removes spaces at the end of the string');
    $this->assertEquals('coilstack', $fs->sanitizeFilename('coil,,,,,,stack'), '::sanitizeFilename() removes non-ASCII characters');
    $this->assertEquals('coil_stack', $fs->sanitizeFilename('coil_stack  '), '::sanitizeFilename() keeps separators');
    $this->assertEquals('coil_stack', $fs->sanitizeFilename(' coil________stack'), '::sanitizeFilename() converts successive separators into a single one');
    $this->assertEquals('coil_stack.gif', $fs->sanitizeFilename('cOil Stack.GiF'), '::sanitizeFilename() lower case filename and extension');
    $this->assertEquals('copy_of_coil.stack.exe', $fs->sanitizeFilename('Copy of coil.stack.exe'), '::sanitizeFilename() keeps dots before the extension');
    $this->assertEquals('default.doc', $fs->sanitizeFilename('____________.doc'), '::sanitizeFilename() returns a default file name if filename only contains special chars');
    $this->assertEquals('default.docx', $fs->sanitizeFilename('     ___ -  --_     __%%%%__¨¨¨***____      .docx'), '::sanitizeFilename() returns a default file name if filename only contains special chars');
    $this->assertEquals('logo_edition_1314352521.jpg', $fs->sanitizeFilename('logo_edition_1314352521.jpg'), '::sanitizeFilename() returns the filename untouched if it does not need to be modified');
    $userId = rand(1, 10);
    $this->assertEquals('user_doc_'. $userId. '.doc', $fs->sanitizeFilename('亐亐亐亐亐.doc', 'user_doc_'. $userId), '::sanitizeFilename() returns the default string (the 2nd argument) if it can\'t be sanitized');
    }
}

测试结果:(在使用 PHP 5.3.2 和 Ubuntu 上检查MacOsX 与 PHP 5.3.17:

All tests pass:

phpunit -c app/ src/COil/Bundle/COilCoreBundle/Tests/Unit/Helper/FilesystemTest.php
PHPUnit 3.6.10 by Sebastian Bergmann.

Configuration read from /var/www/strangebuzz.com/app/phpunit.xml.dist

.

Time: 0 seconds, Memory: 5.75Mb

OK (1 test, 17 assertions)

There are already several solutions provided for this question but I have read and tested most of the code here and I ended up with this solution which is a mix of what I learned here:

The function

The function is bundled here in a Symfony2 bundle but it can be extracted to be used as plain PHP, it only has a dependency with the iconv function that must be enabled:

Filesystem.php:

<?php

namespace COil\Bundle\COilCoreBundle\Component\HttpKernel\Util;

use Symfony\Component\HttpKernel\Util\Filesystem as BaseFilesystem;

/**
 * Extends the Symfony filesystem object.
 */
class Filesystem extends BaseFilesystem
{
    /**
     * Make a filename safe to use in any function. (Accents, spaces, special chars...)
     * The iconv function must be activated.
     *
     * @param string  $fileName       The filename to sanitize (with or without extension)
     * @param string  $defaultIfEmpty The default string returned for a non valid filename (only special chars or separators)
     * @param string  $separator      The default separator
     * @param boolean $lowerCase      Tells if the string must converted to lower case
     *
     * @author COil <https://github.com/COil>
     * @see    http://stackoverflow.com/questions/2668854/sanitizing-strings-to-make-them-url-and-filename-safe
     *
     * @return string
     */
    public function sanitizeFilename($fileName, $defaultIfEmpty = 'default', $separator = '_', $lowerCase = true)
    {
    // Gather file informations and store its extension
    $fileInfos = pathinfo($fileName);
    $fileExt   = array_key_exists('extension', $fileInfos) ? '.'. strtolower($fileInfos['extension']) : '';

    // Removes accents
    $fileName = @iconv('UTF-8', 'us-ascii//TRANSLIT', $fileInfos['filename']);

    // Removes all characters that are not separators, letters, numbers, dots or whitespaces
    $fileName = preg_replace("/[^ a-zA-Z". preg_quote($separator). "\d\.\s]/", '', $lowerCase ? strtolower($fileName) : $fileName);

    // Replaces all successive separators into a single one
    $fileName = preg_replace('!['. preg_quote($separator).'\s]+!u', $separator, $fileName);

    // Trim beginning and ending seperators
    $fileName = trim($fileName, $separator);

    // If empty use the default string
    if (empty($fileName)) {
        $fileName = $defaultIfEmpty;
    }

    return $fileName. $fileExt;
    }
}

The unit tests

What is interesting is that I have created PHPUnit tests, first to test edge cases and so you can check if it fits your needs:
(If you find a bug, feel free to add a test case)

FilesystemTest.php:

<?php

namespace COil\Bundle\COilCoreBundle\Tests\Unit\Helper;

use COil\Bundle\COilCoreBundle\Component\HttpKernel\Util\Filesystem;

/**
 * Test the Filesystem custom class.
 */
class FilesystemTest extends \PHPUnit_Framework_TestCase
{
    /**
     * test sanitizeFilename()
     */
    public function testFilesystem()
    {
    $fs = new Filesystem();

    $this->assertEquals('logo_orange.gif', $fs->sanitizeFilename('--logö  _  __   ___   ora@@ñ--~gé--.gif'), '::sanitizeFilename() handles complex filename with specials chars');
    $this->assertEquals('coilstack', $fs->sanitizeFilename('cOiLsTaCk'), '::sanitizeFilename() converts all characters to lower case');
    $this->assertEquals('cOiLsTaCk', $fs->sanitizeFilename('cOiLsTaCk', 'default', '_', false), '::sanitizeFilename() lower case can be desactivated, passing false as the 4th argument');
    $this->assertEquals('coil_stack', $fs->sanitizeFilename('coil stack'), '::sanitizeFilename() convert a white space to a separator');
    $this->assertEquals('coil-stack', $fs->sanitizeFilename('coil stack', 'default', '-'), '::sanitizeFilename() can use a different separator as the 3rd argument');
    $this->assertEquals('coil_stack', $fs->sanitizeFilename('coil          stack'), '::sanitizeFilename() removes successive white spaces to a single separator');
    $this->assertEquals('coil_stack', $fs->sanitizeFilename('       coil stack'), '::sanitizeFilename() removes spaces at the beginning of the string');
    $this->assertEquals('coil_stack', $fs->sanitizeFilename('coil   stack         '), '::sanitizeFilename() removes spaces at the end of the string');
    $this->assertEquals('coilstack', $fs->sanitizeFilename('coil,,,,,,stack'), '::sanitizeFilename() removes non-ASCII characters');
    $this->assertEquals('coil_stack', $fs->sanitizeFilename('coil_stack  '), '::sanitizeFilename() keeps separators');
    $this->assertEquals('coil_stack', $fs->sanitizeFilename(' coil________stack'), '::sanitizeFilename() converts successive separators into a single one');
    $this->assertEquals('coil_stack.gif', $fs->sanitizeFilename('cOil Stack.GiF'), '::sanitizeFilename() lower case filename and extension');
    $this->assertEquals('copy_of_coil.stack.exe', $fs->sanitizeFilename('Copy of coil.stack.exe'), '::sanitizeFilename() keeps dots before the extension');
    $this->assertEquals('default.doc', $fs->sanitizeFilename('____________.doc'), '::sanitizeFilename() returns a default file name if filename only contains special chars');
    $this->assertEquals('default.docx', $fs->sanitizeFilename('     ___ -  --_     __%%%%__¨¨¨***____      .docx'), '::sanitizeFilename() returns a default file name if filename only contains special chars');
    $this->assertEquals('logo_edition_1314352521.jpg', $fs->sanitizeFilename('logo_edition_1314352521.jpg'), '::sanitizeFilename() returns the filename untouched if it does not need to be modified');
    $userId = rand(1, 10);
    $this->assertEquals('user_doc_'. $userId. '.doc', $fs->sanitizeFilename('亐亐亐亐亐.doc', 'user_doc_'. $userId), '::sanitizeFilename() returns the default string (the 2nd argument) if it can\'t be sanitized');
    }
}

The test results: (checked on Ubuntu with PHP 5.3.2 and MacOsX with PHP 5.3.17:

All tests pass:

phpunit -c app/ src/COil/Bundle/COilCoreBundle/Tests/Unit/Helper/FilesystemTest.php
PHPUnit 3.6.10 by Sebastian Bergmann.

Configuration read from /var/www/strangebuzz.com/app/phpunit.xml.dist

.

Time: 0 seconds, Memory: 5.75Mb

OK (1 test, 17 assertions)
只有一腔孤勇 2024-09-05 13:56:08

解决方案#1:您可以在服务器(托管)上安装 PHP 扩展,

以便将“地球上几乎所有语言”音译为 ASCII 字符。

  1. 首先安装PHP Intl扩展。这是 Debian (Ubuntu) 的命令: sudo aptitude install php5-intl

  2. 这是我的 fileName 函数(创建 test.php 并粘贴以下代码):

<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Test</title>
</head>
<body>
<?php

function pr($string) {
  print '<hr>';
  print '"' . fileName($string) . '"';
  print '<br>';
  print '"' . $string . '"';
}

function fileName($string) {
  // remove html tags
  $clean = strip_tags($string);
  // transliterate
  $clean = transliterator_transliterate('Any-Latin;Latin-ASCII;', $clean);
  // remove non-number and non-letter characters
  $clean = str_replace('--', '-', preg_replace('/[^a-z0-9-\_]/i', '', preg_replace(array(
    '/\s/', 
    '/[^\w-\.\-]/'
  ), array(
    '_', 
    ''
  ), $clean)));
  // replace '-' for '_'
  $clean = strtr($clean, array(
    '-' => '_'
  ));
  // remove double '__'
  $positionInString = stripos($clean, '__');
  while ($positionInString !== false) {
    $clean = str_replace('__', '_', $clean);
    $positionInString = stripos($clean, '__');
  }
  // remove '_' from the end and beginning of the string
  $clean = rtrim(ltrim($clean, '_'), '_');
  // lowercase the string
  return strtolower($clean);
}
pr('_replace(\'~&([a-z]{1,2})(ac134/56f4315981743 8765475[]lt7ňl2ú5äňú138yé73ťž7ýľute|');
pr(htmlspecialchars('<script>alert(\'hacked\')</script>'));
pr('Álix----_Ãxel!?!?');
pr('áéíóúÁÉÍÓÚ');
pr('üÿÄËÏÖÜ.ŸåÅ');
pr('nie4č a a§ôňäääaš');
pr('Мао Цзэдун');
pr('毛泽东');
pr('ماو تسي تونغ');
pr('مائو تسه‌تونگ');
pr('מאו דזה-דונג');
pr('მაო ძედუნი');
pr('Mao Trạch Đông');
pr('毛澤東');
pr('เหมา เจ๋อตง');
?>
</body>
</html>

此行是核心:

  // transliterate
  $clean = transliterator_transliterate('Any-Latin;Latin-ASCII;', $clean);

答案基于 这篇文章

解决方案 #2:您无法在服务器(托管)上安装 PHP 扩展

在此处输入图像描述

音译 CMS Drupal 模块。它支持地球上几乎所有语言。如果您想拥有,我建议检查插件 存储库真正完整的解决方案消毒字符串。

Solution #1: You have ability to install PHP extensions on server (hosting)

For transliteration of "almost every single language on the planet Earth" to ASCII characters.

  1. Install PHP Intl extension first. This is command for Debian (Ubuntu): sudo aptitude install php5-intl

  2. This is my fileName function (create test.php and paste there following code):

<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Test</title>
</head>
<body>
<?php

function pr($string) {
  print '<hr>';
  print '"' . fileName($string) . '"';
  print '<br>';
  print '"' . $string . '"';
}

function fileName($string) {
  // remove html tags
  $clean = strip_tags($string);
  // transliterate
  $clean = transliterator_transliterate('Any-Latin;Latin-ASCII;', $clean);
  // remove non-number and non-letter characters
  $clean = str_replace('--', '-', preg_replace('/[^a-z0-9-\_]/i', '', preg_replace(array(
    '/\s/', 
    '/[^\w-\.\-]/'
  ), array(
    '_', 
    ''
  ), $clean)));
  // replace '-' for '_'
  $clean = strtr($clean, array(
    '-' => '_'
  ));
  // remove double '__'
  $positionInString = stripos($clean, '__');
  while ($positionInString !== false) {
    $clean = str_replace('__', '_', $clean);
    $positionInString = stripos($clean, '__');
  }
  // remove '_' from the end and beginning of the string
  $clean = rtrim(ltrim($clean, '_'), '_');
  // lowercase the string
  return strtolower($clean);
}
pr('_replace(\'~&([a-z]{1,2})(ac134/56f4315981743 8765475[]lt7ňl2ú5äňú138yé73ťž7ýľute|');
pr(htmlspecialchars('<script>alert(\'hacked\')</script>'));
pr('Álix----_Ãxel!?!?');
pr('áéíóúÁÉÍÓÚ');
pr('üÿÄËÏÖÜ.ŸåÅ');
pr('nie4č a a§ôňäääaš');
pr('Мао Цзэдун');
pr('毛泽东');
pr('ماو تسي تونغ');
pr('مائو تسه‌تونگ');
pr('מאו דזה-דונג');
pr('მაო ძედუნი');
pr('Mao Trạch Đông');
pr('毛澤東');
pr('เหมา เจ๋อตง');
?>
</body>
</html>

This line is core:

  // transliterate
  $clean = transliterator_transliterate('Any-Latin;Latin-ASCII;', $clean);

Answer based on this post.

Solution #2: You don't have ability to install PHP extensions on server (hosting)

enter image description here

Pretty good job is done in transliteration module for CMS Drupal. It supports almost every single language on the planet Earth. I suggest to check plugin repository if you want to have really complete solution sanitizing strings.

寄意 2024-09-05 13:56:08

这篇文章似乎是我所关联的所有文章中效果最好的。 http://gsynuh.com/php-string-filename-url-safe/205

This post seems to work the best among all that I have tied. http://gsynuh.com/php-string-filename-url-safe/205

寄居者 2024-09-05 13:56:08

这是 Prestashop 用于清理 url 的代码:

replaceAccentedChars

用于

str2url

删除变音符号

function replaceAccentedChars($str)
{
    $patterns = array(
        /* Lowercase */
        '/[\x{0105}\x{00E0}\x{00E1}\x{00E2}\x{00E3}\x{00E4}\x{00E5}]/u',
        '/[\x{00E7}\x{010D}\x{0107}]/u',
        '/[\x{010F}]/u',
        '/[\x{00E8}\x{00E9}\x{00EA}\x{00EB}\x{011B}\x{0119}]/u',
        '/[\x{00EC}\x{00ED}\x{00EE}\x{00EF}]/u',
        '/[\x{0142}\x{013E}\x{013A}]/u',
        '/[\x{00F1}\x{0148}]/u',
        '/[\x{00F2}\x{00F3}\x{00F4}\x{00F5}\x{00F6}\x{00F8}]/u',
        '/[\x{0159}\x{0155}]/u',
        '/[\x{015B}\x{0161}]/u',
        '/[\x{00DF}]/u',
        '/[\x{0165}]/u',
        '/[\x{00F9}\x{00FA}\x{00FB}\x{00FC}\x{016F}]/u',
        '/[\x{00FD}\x{00FF}]/u',
        '/[\x{017C}\x{017A}\x{017E}]/u',
        '/[\x{00E6}]/u',
        '/[\x{0153}]/u',

        /* Uppercase */
        '/[\x{0104}\x{00C0}\x{00C1}\x{00C2}\x{00C3}\x{00C4}\x{00C5}]/u',
        '/[\x{00C7}\x{010C}\x{0106}]/u',
        '/[\x{010E}]/u',
        '/[\x{00C8}\x{00C9}\x{00CA}\x{00CB}\x{011A}\x{0118}]/u',
        '/[\x{0141}\x{013D}\x{0139}]/u',
        '/[\x{00D1}\x{0147}]/u',
        '/[\x{00D3}]/u',
        '/[\x{0158}\x{0154}]/u',
        '/[\x{015A}\x{0160}]/u',
        '/[\x{0164}]/u',
        '/[\x{00D9}\x{00DA}\x{00DB}\x{00DC}\x{016E}]/u',
        '/[\x{017B}\x{0179}\x{017D}]/u',
        '/[\x{00C6}]/u',
        '/[\x{0152}]/u');

    $replacements = array(
            'a', 'c', 'd', 'e', 'i', 'l', 'n', 'o', 'r', 's', 'ss', 't', 'u', 'y', 'z', 'ae', 'oe',
            'A', 'C', 'D', 'E', 'L', 'N', 'O', 'R', 'S', 'T', 'U', 'Z', 'AE', 'OE'
        );

    return preg_replace($patterns, $replacements, $str);
}

function str2url($str)
{
    if (function_exists('mb_strtolower'))
        $str = mb_strtolower($str, 'utf-8');

    $str = trim($str);
    if (!function_exists('mb_strtolower'))
        $str = replaceAccentedChars($str);

    // Remove all non-whitelist chars.
    $str = preg_replace('/[^a-zA-Z0-9\s\'\:\/\[\]-\pL]/u', '', $str);
    $str = preg_replace('/[\s\'\:\/\[\]-]+/', ' ', $str);
    $str = str_replace(array(' ', '/'), '-', $str);

    // If it was not possible to lowercase the string with mb_strtolower, we do it after the transformations.
    // This way we lose fewer special chars.
    if (!function_exists('mb_strtolower'))
        $str = strtolower($str);

    return $str;
}

This is the code used by Prestashop to sanitize urls :

replaceAccentedChars

is used by

str2url

to remove diacritics

function replaceAccentedChars($str)
{
    $patterns = array(
        /* Lowercase */
        '/[\x{0105}\x{00E0}\x{00E1}\x{00E2}\x{00E3}\x{00E4}\x{00E5}]/u',
        '/[\x{00E7}\x{010D}\x{0107}]/u',
        '/[\x{010F}]/u',
        '/[\x{00E8}\x{00E9}\x{00EA}\x{00EB}\x{011B}\x{0119}]/u',
        '/[\x{00EC}\x{00ED}\x{00EE}\x{00EF}]/u',
        '/[\x{0142}\x{013E}\x{013A}]/u',
        '/[\x{00F1}\x{0148}]/u',
        '/[\x{00F2}\x{00F3}\x{00F4}\x{00F5}\x{00F6}\x{00F8}]/u',
        '/[\x{0159}\x{0155}]/u',
        '/[\x{015B}\x{0161}]/u',
        '/[\x{00DF}]/u',
        '/[\x{0165}]/u',
        '/[\x{00F9}\x{00FA}\x{00FB}\x{00FC}\x{016F}]/u',
        '/[\x{00FD}\x{00FF}]/u',
        '/[\x{017C}\x{017A}\x{017E}]/u',
        '/[\x{00E6}]/u',
        '/[\x{0153}]/u',

        /* Uppercase */
        '/[\x{0104}\x{00C0}\x{00C1}\x{00C2}\x{00C3}\x{00C4}\x{00C5}]/u',
        '/[\x{00C7}\x{010C}\x{0106}]/u',
        '/[\x{010E}]/u',
        '/[\x{00C8}\x{00C9}\x{00CA}\x{00CB}\x{011A}\x{0118}]/u',
        '/[\x{0141}\x{013D}\x{0139}]/u',
        '/[\x{00D1}\x{0147}]/u',
        '/[\x{00D3}]/u',
        '/[\x{0158}\x{0154}]/u',
        '/[\x{015A}\x{0160}]/u',
        '/[\x{0164}]/u',
        '/[\x{00D9}\x{00DA}\x{00DB}\x{00DC}\x{016E}]/u',
        '/[\x{017B}\x{0179}\x{017D}]/u',
        '/[\x{00C6}]/u',
        '/[\x{0152}]/u');

    $replacements = array(
            'a', 'c', 'd', 'e', 'i', 'l', 'n', 'o', 'r', 's', 'ss', 't', 'u', 'y', 'z', 'ae', 'oe',
            'A', 'C', 'D', 'E', 'L', 'N', 'O', 'R', 'S', 'T', 'U', 'Z', 'AE', 'OE'
        );

    return preg_replace($patterns, $replacements, $str);
}

function str2url($str)
{
    if (function_exists('mb_strtolower'))
        $str = mb_strtolower($str, 'utf-8');

    $str = trim($str);
    if (!function_exists('mb_strtolower'))
        $str = replaceAccentedChars($str);

    // Remove all non-whitelist chars.
    $str = preg_replace('/[^a-zA-Z0-9\s\'\:\/\[\]-\pL]/u', '', $str);
    $str = preg_replace('/[\s\'\:\/\[\]-]+/', ' ', $str);
    $str = str_replace(array(' ', '/'), '-', $str);

    // If it was not possible to lowercase the string with mb_strtolower, we do it after the transformations.
    // This way we lose fewer special chars.
    if (!function_exists('mb_strtolower'))
        $str = strtolower($str);

    return $str;
}
带刺的爱情 2024-09-05 13:56:08

这是一个很好的功能:

public function getFriendlyURL($string) {
    setlocale(LC_CTYPE, 'en_US.UTF8');
    $string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
    $string = preg_replace('~[^\-\pL\pN\s]+~u', '-', $string);
    $string = str_replace(' ', '-', $string);
    $string = trim($string, "-");
    $string = strtolower($string);
    return $string;
} 

This is a good function:

public function getFriendlyURL($string) {
    setlocale(LC_CTYPE, 'en_US.UTF8');
    $string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
    $string = preg_replace('~[^\-\pL\pN\s]+~u', '-', $string);
    $string = str_replace(' ', '-', $string);
    $string = trim($string, "-");
    $string = strtolower($string);
    return $string;
} 
空宴 2024-09-05 13:56:08

有两个很好的答案可以解决您的数据问题,请使用它 https://stackoverflow.com/a/3987966/971619 或它https://stackoverflow.com/a/7610586/971619

There is 2 good answers to slugfy your data, use it https://stackoverflow.com/a/3987966/971619 or it https://stackoverflow.com/a/7610586/971619

流云如水 2024-09-05 13:56:08
// CLEAN ILLEGAL CHARACTERS
function clean_filename($source_file)
{
    $search[] = " ";
    $search[] = "&";
    $search[] = "$";
    $search[] = ",";
    $search[] = "!";
    $search[] = "@";
    $search[] = "#";
    $search[] = "^";
    $search[] = "(";
    $search[] = ")";
    $search[] = "+";
    $search[] = "=";
    $search[] = "[";
    $search[] = "]";

    $replace[] = "_";
    $replace[] = "and";
    $replace[] = "S";
    $replace[] = "_";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";

    return str_replace($search,$replace,$source_file);

} 
// CLEAN ILLEGAL CHARACTERS
function clean_filename($source_file)
{
    $search[] = " ";
    $search[] = "&";
    $search[] = "$";
    $search[] = ",";
    $search[] = "!";
    $search[] = "@";
    $search[] = "#";
    $search[] = "^";
    $search[] = "(";
    $search[] = ")";
    $search[] = "+";
    $search[] = "=";
    $search[] = "[";
    $search[] = "]";

    $replace[] = "_";
    $replace[] = "and";
    $replace[] = "S";
    $replace[] = "_";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";
    $replace[] = "";

    return str_replace($search,$replace,$source_file);

} 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文