我正在尝试提出一个函数,可以很好地清理某些字符串,以便它们可以安全地在 URL 中使用(如 post slug),并且也可以安全地用作文件名。例如,当有人上传文件时,我想确保删除名称中的所有危险字符。
到目前为止,我已经提出了以下函数,我希望它能解决这个问题并允许外部 UTF-8 数据。
/**
* Convert a string to the file/URL safe "slug" form
*
* @param string $string the string to clean
* @param bool $is_filename TRUE will allow additional filename characters
* @return string
*/
function sanitize($string = '', $is_filename = FALSE)
{
// Replace all weird characters with dashes
$string = preg_replace('/[^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u', '-', $string);
// Only allow one dash separator at a time (and make string lowercase)
return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');
}
是否有人有任何棘手的示例数据,我可以针对此运行 - 或者知道更好的方法来保护我们的应用程序免受不良名称的影响?
$is-filename 允许一些附加字符,例如临时 vim 文件< /em>
更新:删除了星号,因为我想不出有效的用途
I am trying to come up with a function that does a good job of sanitizing certain strings so that they are safe to use in the URL (like a post slug) and also safe to use as file names. For example, when someone uploads a file I want to make sure that I remove all dangerous characters from the name.
So far I have come up with the following function which I hope solves this problem and allows foreign UTF-8 data also.
/**
* Convert a string to the file/URL safe "slug" form
*
* @param string $string the string to clean
* @param bool $is_filename TRUE will allow additional filename characters
* @return string
*/
function sanitize($string = '', $is_filename = FALSE)
{
// Replace all weird characters with dashes
$string = preg_replace('/[^\w\-'. ($is_filename ? '~_\.' : ''). ']+/u', '-', $string);
// Only allow one dash separator at a time (and make string lowercase)
return mb_strtolower(preg_replace('/--+/u', '-', $string), 'UTF-8');
}
Does anyone have any tricky sample data I can run against this - or know of a better way to safeguard our apps from bad names?
$is-filename allows some additional characters like temp vim files
update: removed the star character since I could not think of a valid use
发布评论
评论(23)
我在 Chyrp 代码中找到了这个更大的函数:
和此代码位于
2012 年 9 月更新的
wordpress 代码Alix Axel 在这个领域做了一些令人难以置信的工作。他的语音输入框架包括几个很棒的文本过滤器和转换。
I found this larger function in the Chyrp code:
and this one in the wordpress code
Update Sept 2012
Alix Axel has done some incredible work in this area. His phunction framework includes several great text filters and transformations.
对您的解决方案的一些观察:
创建 slug
您可能不应该在帖子 slug 中包含重音等字符,因为从技术上讲,它们应该进行百分比编码(根据 URL 编码规则),因此您将获得难看的 URL。
所以,如果我是你,在小写后,我会将任何“特殊”字符转换为它们的等效字符(例如 é -> e),并将非 [az] 字符替换为“-”,限制为单个“-”的运行'就像你所做的那样。这里有一个转换特殊字符的实现: https://web. archive.org/web/20130208144021/http://neo22s.com/slug
总体清理
OWASP 有其企业安全 API 的 PHP 实现,其中包括用于安全编码和解码输入和输出的方法应用。
编码器接口提供:
https://github.com/OWASP/PHP-ESAPI
https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API
Some observations on your solution:
Creating the slug
You probably shouldn't include accented etc. characters in your post slug since, technically, they should be percent encoded (per URL encoding rules) so you'll have ugly looking URLs.
So, if I were you, after lowercasing, I'd convert any 'special' characters to their equivalent (e.g. é -> e) and replace non [a-z] characters with '-', limiting to runs of a single '-' as you've done. There's an implementation of converting special characters here: https://web.archive.org/web/20130208144021/http://neo22s.com/slug
Sanitization in general
OWASP have a PHP implementation of their Enterprise Security API which among other things includes methods for safe encoding and decoding input and output in your application.
The Encoder interface provides:
https://github.com/OWASP/PHP-ESAPI
https://www.owasp.org/index.php/Category:OWASP_Enterprise_Security_API
这应该使您的文件名安全......
而更深入的解决方案是:
这假设您希望文件名中包含一个点。
如果您希望将其转换为小写,只需
在最后一行使用即可。
This should make your filenames safe...
and a deeper solution to this is:
This assumes that you want a dot in the filename.
if you want it transferred to lowercase, just use
for the last line.
试试这个:
根据此线程中选定的答案:PHP 中的 URL 友好用户名?
Try this:
Based on the selected answer in this thread: URL Friendly Username in PHP?
这不完全是一个答案,因为它没有提供任何解决方案(还!),但它太大了,无法容纳评论......
我在 Windows 上做了一些测试(关于文件名) 7 和 Ubuntu 12.04,我发现:
1. PHP 无法处理非 ASCII 文件名
虽然 Windows 和 Ubuntu 都可以处理 Unicode 文件名(甚至看起来是 RTL 文件名),但 PHP 5.3 甚至需要一些技巧来处理普通的旧版 ISO-8859-1,所以它更好仅出于安全考虑才将其保留为 ASCII。
2.文件名的长度很重要(特别是在 Windows 上)
在 Ubuntu 上,文件名的最大长度(包括扩展名)为 255(不包括路径):
但是,在 Windows 7 (NTFS) 上文件名的最大长度取决于它的绝对路径:
维基百科 说:
据我所知(和测试),这是错误的。
总共(算上斜杠)所有这些示例都有 259 个字符,如果您去掉
C:\
给出 256 个字符(不是 255?!)。使用资源管理器创建的目录,您会注意到它限制自己使用目录名称的所有可用空间。这样做的原因是允许使用 8.3 文件命名约定 创建文件。其他分区也会发生同样的情况。文件当然不需要保留 8.3 长度要求:
如果父目录的绝对路径超过 242 个字符,则不能再创建任何子目录,因为
256 = 242 + 1 + \ + 8+. + 3
。使用 Windows 资源管理器,如果父目录超过 233 个字符(取决于系统区域设置),则无法创建另一个目录,因为256 = 233 + 10 + \ + 8 + 。 + 3
;这里的10
是字符串Newfolder
的长度。如果您想确保文件系统之间的互操作性,Windows 文件系统会带来一个棘手的问题。
3.谨防保留字符和关键字
除了删除非 ASCII、不可打印和控制字符之外,您还需要重新(放置/移动):
仅仅删除这些字符可能并不最好的主意,因为文件名可能会失去一些含义。我认为,至少,这些字符的多次出现应该替换为单个下划线(
_
),或者也许更具代表性的东西(这只是一个想法):"*?
->_
/\|
->-
:
->[ ]-[ ]
<
->(
>
)
还有特殊关键字避免(如
NUL
),尽管我不确定如何克服这个问题,也许带有随机名称后备的黑名单是解决这个问题的好方法4 。大小写敏感
这应该是不言而喻的,但如果您想确保文件在不同操作系统之间的唯一性,您应该将文件名转换为规范化的大小写,这样
my_file.txt
和Linux 上的 >My_File.txt
在 Windows 上不会成为相同的my_file.txt
文件。5.确保它是唯一的
如果文件名已经存在,则唯一标识符应该是附加到它的基本文件名。
常见的唯一标识符包括 UNIX 时间戳、文件内容的摘要或随机字符串。
6.隐藏文件
仅仅因为它可以命名并不意味着它应该...
点通常在文件名中列入白名单,但在 Linux 中,隐藏文件由前导点表示。
7.其他注意事项
如果您必须删除文件名中的某些字符,则扩展名通常比文件的基本名称更重要。允许文件扩展名的最大字符数 (8-16)从基本名称中删除字符。还需要注意的是,万一出现多个长扩展名(例如
_.graphmlz.tag.gz
-_.graphmlz.tag
)时,仅<在这种情况下,code>_ 应被视为文件基本名称。8.资源
Calibre 相当不错地处理文件名修改:
有关文件名修改的维基百科页面和链接的使用 Samba 的章节。
例如,如果您尝试创建一个违反任何规则 1/2/3 的文件,您将收到一个非常有用的错误:
This isn't exactly an answer as it doesn't provide any solutions (yet!), but it's too big to fit on a comment...
I did some testing (regarding file names) on Windows 7 and Ubuntu 12.04 and what I found out was that:
1. PHP Can't Handle non-ASCII Filenames
Although both Windows and Ubuntu can handle Unicode filenames (even RTL ones as it seems) PHP 5.3 requires hacks to deal even with the plain old ISO-8859-1, so it's better to keep it ASCII only for safety.
2. The Lenght of the Filename Matters (Specially on Windows)
On Ubuntu, the maximum length a filename can have (incluinding extension) is 255 (excluding path):
However, on Windows 7 (NTFS) the maximum lenght a filename can have depends on it's absolute path:
Wikipedia says that:
To the best of my knowledge (and testing), this is wrong.
In total (counting slashes) all these examples have 259 chars, if you strip the
C:\
that gives 256 characters (not 255?!). The directories where created using the Explorer and you'll notice that it restrains itself from using all the available space for the directory name. The reason for this is to allow the creation of files using the 8.3 file naming convention. The same thing happens for other partitions.Files don't need to reserve the 8.3 lenght requirements of course:
You can't create any more sub-directories if the absolute path of the parent directory has more than 242 characters, because
256 = 242 + 1 + \ + 8 + . + 3
. Using Windows Explorer, you can't create another directory if the parent directory has more than 233 characters (depending on the system locale), because256 = 233 + 10 + \ + 8 + . + 3
; the10
here is the length of the stringNew folder
.Windows file system poses a nasty problem if you want to assure inter-operability between file systems.
3. Beware of Reserved Characters and Keywords
Aside from removing non-ASCII, non-printable and control characters, you also need to re(place/move):
Just removing these characters might not be the best idea because the filename might lose some of it's meaning. I think that, at the very least, multiple occurences of these characters should be replaced by a single underscore (
_
), or perhaps something more representative (this is just an idea):"*?
->_
/\|
->-
:
->[ ]-[ ]
<
->(
>
->)
There are also special keywords that should be avoided (like
NUL
), although I'm not sure how to overcome that. Perhaps a black list with a random name fallback would be a good approach to solve it.4. Case Sensitiveness
This should go without saying, but if you want so ensure file uniqueness across different operating systems you should transform file names to a normalized case, that way
my_file.txt
andMy_File.txt
on Linux won't both become the samemy_file.txt
file on Windows.5. Make Sure It's Unique
If the file name already exists, a unique identifier should be appended to it's base file name.
Common unique identifiers include the UNIX timestamp, a digest of the file contents or a random string.
6. Hidden Files
Just because it can be named doesn't mean it should...
Dots are usually white-listed in file names but in Linux a hidden file is represented by a leading dot.
7. Other Considerations
If you have to strip some chars of the file name, the extension is usually more important than the base name of the file. Allowing a considerable maximum number of characters for the file extension (8-16) one should strip the characters from the base name. It's also important to note that in the unlikely event of having a more than one long extension - such as
_.graphmlz.tag.gz
-_.graphmlz.tag
only_
should be considered as the file base name in this case.8. Resources
Calibre handles file name mangling pretty decently:
Wikipedia page on file name mangling and linked chapter from Using Samba.
If for instance, you try to create a file that violates any of the rules 1/2/3, you'll get a very useful error:
我一直认为 Kohana 做得很好。
方便的
UTF8::transliterate_to_ascii()
会将 ñ => 之类的内容转换为 ñ => 。名词当然,您可以用 mb_* 函数替换其他
UTF8::*
内容。I've always thought Kohana did a pretty good job of it.
The handy
UTF8::transliterate_to_ascii()
will turn stuff like ñ => n.Of course, you could replace the other
UTF8::*
stuff with mb_* functions.我从另一个来源改编而来,并添加了一些额外的内容,也许有点矫枉过正
I have adapted from another source and added a couple extra, maybe a little overkill
就文件上传而言,防止用户控制文件名是最安全的。正如已经暗示的那样,将规范化的文件名与随机选择的唯一名称一起存储在数据库中,您将使用该名称作为实际文件名。
使用 OWASP ESAPI,可以这样生成这些名称:
您可以将时间戳附加到 $safeFilename,以帮助确保随机生成的文件名是唯一的,甚至无需检查现有文件。
在 URL 编码方面,再次使用 ESAPI:
此方法在对字符串进行编码之前执行规范化,并将处理所有字符编码。
In terms of file uploads, you would be safest to prevent the user from controlling the file name. As has already been hinted at, store the canonicalised filename in a database along with a randomly chosen and unique name which you'll use as the actual filename.
Using OWASP ESAPI, these names could be generated thus:
You could append a timestamp to the $safeFilename to help ensure that the randomly generated filename is unique without even checking for an existing file.
In terms of encoding for URL, and again using ESAPI:
This method performs canonicalisation before encoding the string and will handle all character encodings.
我推荐* URLify for PHP(Github 上有 480+ 颗星) - “来自 URLify.js 的 PHP 端口Django 项目。音译非 ASCII 字符以在 URL 中使用”。
基本用法:
为 URL 生成 slugs:
为文件名生成 slugs:
*其他建议都不符合我的标准:
作为奖励,URLify 还会删除某些单词并删除所有未音译的字符。
这是一个使用 URLify 正确音译大量外来字符的测试用例: https://gist.github.com /motin/a65e6c1cc303e46900d10894bf2da87f
I recommend* URLify for PHP (480+ stars on Github) - "the PHP port of URLify.js from the Django project. Transliterates non-ascii characters for use in URLs".
Basic usage:
To generate slugs for URLs:
To generate slugs for file names:
*None of the other suggestions matched my criteria:
As a bonus, URLify also removes certain words and strips away all characters not transliterated.
Here is a test case with tons of foreign characters being transliterated properly using URLify: https://gist.github.com/motin/a65e6c1cc303e46900d10894bf2da87f
这是来自
JFile::makeSafe 的 Joomla 3.3.2 版本($文件)
and this is Joomla 3.3.2 version from
JFile::makeSafe($file)
根据您使用它的方式,您可能需要添加长度限制以防止缓冲区溢出。
Depending on how you will use it, you might want to add a length limit to protect against buffer overflows.
我认为拥有要删除的字符列表并不安全。我宁愿使用以下内容:
对于文件名:使用内部 ID 或文件内容的哈希值。将文档名称保存在数据库中。这样您可以保留原始文件名并仍然可以找到该文件。
对于 url 参数:使用
urlencode()
对任何特殊字符进行编码。I don't think having a list of chars to remove is safe. I would rather use the following:
For filenames: Use an internal ID or a hash of the filecontent. Save the document name in a database. This way you can keep the original filename and still find the file.
For url parameters: Use
urlencode()
to encode any special characters.这是保护上传文件名的好方法:
This is a nice way to secure an upload filename:
这是 CodeIgniter 的实现。
以及
remove_invisible_characters
依赖项。Here's CodeIgniter's implementation.
And the
remove_invisible_characters
dependency.我的条目标题包含各种奇怪的拉丁字符以及一些 HTML 标签,我需要将它们转换为有用的破折号分隔的文件名格式。我将 @SoLoGHoST 的答案与 @Xeoncross 的答案中的几个项目结合起来,并进行了一些定制。
我需要手动将长破折号字符 (—) 添加到翻译数组中。可能还有其他的,但到目前为止我的文件名看起来不错。
所以:
第 1 部分:我爸爸的“Žurburts”?——他们(不是)最好的!
变成:
第 1 部分-my-dads-zurburts-theyre-not-the-best
我只是将“.html”添加到返回的字符串中。
I have entry titles with all kinds of weird latin characters as well as some HTML tags that I needed to translate into a useful dash-delimited filename format. I combined @SoLoGHoST's answer with a couple of items from @Xeoncross's answer and customized a bit.
I needed to manually add the em dash character (—) to the translation array. There may be others but so far my file names are looking good.
So:
Part 1: My dad’s “Žurburts”?—they’re (not) the best!
becomes:
part-1-my-dads-zurburts-theyre-not-the-best
I just add ".html" to the returned string.
为什么不简单地使用 php 的
urlencode
呢?它将“危险”字符替换为 url 的十六进制表示形式(即%20
表示空格)why not simply use php's
urlencode
? it replaces "dangerous" characters with their hex representation for urls (i.e.%20
for a space)已经为这个问题提供了几个解决方案,但我已经阅读并测试了这里的大部分代码,我最终得到了这个解决方案,它是我在这里学到的内容的混合:
该函数
该函数被捆绑在 Symfony2 中 捆绑包,但可以将其提取出来用作纯 PHP,它只依赖于必须启用的
iconv
函数:Filesystem. php:
单元测试
有趣的是,我创建了 PHPUnit 测试,首先是测试边缘情况,这样您就可以检查它是否满足您的需求:
(如果您发现错误,请随时添加测试用例)
FilesystemTest.php:
测试结果:(在使用 PHP 5.3.2 和 的 Ubuntu 上检查MacOsX 与 PHP 5.3.17:
There are already several solutions provided for this question but I have read and tested most of the code here and I ended up with this solution which is a mix of what I learned here:
The function
The function is bundled here in a Symfony2 bundle but it can be extracted to be used as plain PHP, it only has a dependency with the
iconv
function that must be enabled:Filesystem.php:
The unit tests
What is interesting is that I have created PHPUnit tests, first to test edge cases and so you can check if it fits your needs:
(If you find a bug, feel free to add a test case)
FilesystemTest.php:
The test results: (checked on Ubuntu with PHP 5.3.2 and MacOsX with PHP 5.3.17:
解决方案#1:您可以在服务器(托管)上安装 PHP 扩展,
以便将“地球上几乎所有语言”音译为 ASCII 字符。
首先安装PHP Intl扩展。这是 Debian (Ubuntu) 的命令:
sudo aptitude install php5-intl
这是我的 fileName 函数(创建 test.php 并粘贴以下代码):
此行是核心:
答案基于 这篇文章。
解决方案 #2:您无法在服务器(托管)上安装 PHP 扩展
音译 CMS Drupal 模块。它支持地球上几乎所有语言。如果您想拥有,我建议检查插件 存储库真正完整的解决方案消毒字符串。
Solution #1: You have ability to install PHP extensions on server (hosting)
For transliteration of "almost every single language on the planet Earth" to ASCII characters.
Install PHP Intl extension first. This is command for Debian (Ubuntu):
sudo aptitude install php5-intl
This is my fileName function (create test.php and paste there following code):
This line is core:
Answer based on this post.
Solution #2: You don't have ability to install PHP extensions on server (hosting)
Pretty good job is done in transliteration module for CMS Drupal. It supports almost every single language on the planet Earth. I suggest to check plugin repository if you want to have really complete solution sanitizing strings.
这篇文章似乎是我所关联的所有文章中效果最好的。 http://gsynuh.com/php-string-filename-url-safe/205
This post seems to work the best among all that I have tied. http://gsynuh.com/php-string-filename-url-safe/205
这是 Prestashop 用于清理 url 的代码:
用于
删除变音符号
This is the code used by Prestashop to sanitize urls :
is used by
to remove diacritics
这是一个很好的功能:
This is a good function:
有两个很好的答案可以解决您的数据问题,请使用它 https://stackoverflow.com/a/3987966/971619 或它https://stackoverflow.com/a/7610586/971619
There is 2 good answers to slugfy your data, use it https://stackoverflow.com/a/3987966/971619 or it https://stackoverflow.com/a/7610586/971619