准备 PHP 应用程序以使用 UTF-8
UTF-8 现在是 Web 应用程序事实上的标准,但 PHP 这不是 PHP 的默认编码(直到 6.0)。大多数服务器默认设置为 ISO-8859-1 编码。
如何重载 .htaccess
中的默认设置以确保 UTF-8、区域设置等一切顺利? Web 服务器、Unix 操作系统有什么选择吗?
有这些设置的完整列表吗?例如,我应该为每个多语言项目设置 mbstring
选项、iconv
设置、区域设置等?有预定义的 .htaccess
作为示例吗?
(在我的特殊情况下,我需要设置以下语言:英语、荷兰语和俄语。服务器位于乌克兰)。
UTF-8 is de facto standard for web applications now, but PHP this is not a default encoding for PHP (until 6.0). Most of the server is set up for the ISO-8859-1 encoding by default.
How to overload the default settings in the .htaccess
to be sure that everything goes well for UTF-8, locale etc.? Any options for the web server, Unix OS?
Is there any comprehensive list of those settings? E.g. mbstring
options, iconv
settings, locale etc I should set up for each multi language project? Any pre defined .htaccess
as an example?
(In my particular case I need setup for the languages: English, Dutch and Russian. The server is in Ukraine).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
.htaccess
中有一些有用的选项:Some useful options to have in
.htaccess
:你是对的,
UTF-8
对于网络应用程序来说是一个不错的选择。编码是所处理的数据的元信息。只要您知道(二进制)数据的编码,您就知道您正在处理什么。如果您不知道编码,您就会开始迷失方向。我经常称之为链,如果编码链被破坏,数据就会被破坏。对于显示数据和安全性来说都是如此。
根据经验,PHP 是二进制的,由上下文/您指定编码(例如,您如何保存 php 源代码文件)。
因此,让我们处理一个简短(且不完整)的列表:
操作系统
环境变量可能会告诉您有关使用的区域设置和编码的信息。例如,文件系统确实有文件和目录名称的编码。我对这个主题不是很坚定,通常我们尝试用英语命名我们的文件,以便仅使用
US-ASCII
范围内的字符,这对于像这样的拉丁扩展字符集是安全的ISO-8859-1 在您的情况下以及
UTF-8
。当您保存用户上传的文件时,请记住这一点:只需将文件名过滤为基本字母和标点符号,您几乎不会遇到任何麻烦(
az
、AZ
、0-9
、.
、-
、_
),甚至为了视觉目的将它们全部小写。如果您觉得这会降低可用性,并且文件系统不提供 UTF-8 的 unicode 字符范围,您可以回退到简单的编码,例如
rawurlencode
(百分比编码,三元组)并通过将该名称解析到磁盘来提供要下载的文件。通常你只需要处理你所拥有的。开始向普通系统管理员或程序员询问有关字符编码的问题,大多数人会告诉您他们并不真正感兴趣。当然,这是主观的,但如果您需要有人为您配置某些东西,这可能会有所不同。
HTML
这仅独立于 PHP,它与脚本提供的输出有关,因此是工作领域。
经验法则是:指定它。如果您没有指定它(HTML 文件、CSS 文件、Javascript 文件),则不要指望它能够精确工作。那就做吧。编码是一条链,如果有很多组件,请确保每个组件都知道它的编码。否则浏览器只能猜测。
UTF-8
是一个不错的选择,但我们的工作是小心谨慎并使其精确且定义良好。PHP 设置
作为一般经验法则,请开始阅读 Linux 发行版的 PHP 包附带的
php.ini
文件。它在注释和进一步链接中附带了可读的文档。我想到的一些设置:default_charset
- PHP 始终在 Content-type: 标头中默认输出字符编码。要禁用发送字符集,只需将其设置为空 (来源)。有关一般信息,请参阅设置 HTTP 字符集参数W3C< /sup>。如果您想改进站点的输出,例如,当用户使用浏览器保存输出时保留编码信息,请添加 HTML http-equiv 元标记以及。
output_handler
- 此设置值得一看,因为它指定输出处理程序(输出缓冲控制文档)和每个处理程序(mb
、iconv
)可以有自己的编码设置(请参阅字符串)。字符串
$binary = (binary) $string;
或$binary = b"binary string";
。mb_internal_encoding()
文档 - 获取或设置它;mbstring.internal_encoding
< sup>INI。内部编码是用于HTTP输入字符编码转换、HTTP输出字符编码转换以及mbstring模块定义的字符串函数的默认字符编码的字符编码名称。iconv_set_encoding()
文档 - 与 iconv 扩展相当。另请参阅 iconv 配置设置。htmlspecialchars
文档 。使用这些参数并检查文档以了解其默认值。通常它是ISO-8859-1
,但您正在寻找UTF-8
。其他函数,例如html_entity_decode
< em>文档默认使用UTF-8
。有些像htmlspecialchars_decode
没有指定一个字符集,所以你需要 阅读 PHP 源代码,以具体了解该函数如何处理(二进制)字符串。回答您的问题:设置和参数的需要始终取决于您使用的组件。对于浏览器或网络服务器等一般设备,可以提供推荐设置以将其配置为
UTF-8
。但其他一切都取决于。最重要的是查找它并确保您知道编码并且可以配置/指定它。通常它会被记录下来。只要您不需要处理可移植代码,这就会简单得多,因为您可以控制环境或者您只需要处理特定环境。编写防御性代码并考虑到编码,你应该没问题。You're right
UTF-8
is a good choice for webapplications.Encoding is meta-information to the data that get's processed. As long as you know the encoding of the (binary) data, you know what you're dealing with. You start to get lost, if you don't know the encoding. I often call this a chain, if the encoding-chain is broken, the data will be broken. This is both true for displaying data as well as for security.
As a rule of thumb, PHP is binary, it's the context/you who specifies the encoding (e.g. how you save your php source-code files).
So let's tackle a short (and incomplete) list:
The OS
Environment variables might tell you about the locale in use and the encoding. File-systems do have their encoding for names of files and directories for example. I'm not very firm to this subject, normally we try to name our files in english so to use only characters in the range of
US-ASCII
which is safe for the Latin extended charsets likeISO-8859-1
in your case as well as forUTF-8
.Just keep this in mind when you save files your users upload: Just filter filenames to basic letters and punctation and you'll have nearly no hassles (
a-z
,A-Z
,0-9
,.
,-
,_
), even make them all lowercase for visual purposes.If you feel that this degrades usability and the file-system does not offer the unicode range of characters as of UTF-8, you can fallback to simple encodings like
rawurlencode
(Percent-Encoding, triplet) and offer files to download by resolving that name to disk.Normally you just need to deal with what you have. Start asking a common sysadmin or programmer about character encoding and most will tell you that they are not really interested. Naturally that's subjective, but if you need someone to configure something for you, this can make a difference.
HTML
This is merely independent to PHP, it's about the output your scripts provide so the field of work.
Rule of thumb is: Specify it. If you didn't specifiy it (HTML files, CSS files, Javascript files) don't expect it to work precisely. Just do it then. Encoding is a chain, if there are many components, ensure that each knows about it's encoding. Otherwise browsers can only guess.
UTF-8
is a good choice so, but our job is to take care and make this precise and well defined.PHP Settings
As a general rule of thumb, start reading the
php.ini
file that ships with the PHP package of your linux distro. It comes with readable documentation in it's comments and further links. Some settings that come to my mind:default_charset
- PHP always outputs a character encoding by default in the Content-type: header. To disable sending of the charset, simply set it to be empty (Source). For general information see Setting the HTTP charset parameterW3C. If you want to improve your site's output, e.g. for preserving the encoding information when users save the output with their browser, add the HTML http-equiv meta tag as well<meta http-equiv="Content-type" content="text/html;charset=UTF-8">
.output_handler
- This setting is worth to look at as it is specifying the output handler (Output Buffering ControlDocs) and each handler (mb
,iconv
) can have it's own encoding settings (see Strings).Strings
$binary = (binary) $string;
or$binary = b"binary string";
.mb_internal_encoding()
Docs - Gain or set it;mbstring.internal_encoding
INI. The internal encoding is the character encoding name used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module.iconv_set_encoding()
Docs - Comparable for the iconv extension. See as well the iconv configuration settings.htmlspecialchars
Docs. Make use of these parameters and check the docs for their default value. Often it isISO-8859-1
but you're looking forUTF-8
. Other functions likehtml_entity_decode
Docs are usingUTF-8
per default. Some likehtmlspecialchars_decode
do not specify a charset at all, so you need to read the PHP source-code for a concrete specific understanding of how the function deals with the (binary) string.To answer your question: The need of settings and parameters always depend on the components you use. For the general ones like the browser or the webserver, it's possible to give recommendation settings to get it configured for
UTF-8
. But with everything else it depends. The most important thing is to look for it and to ensure that you know the encoding and can configure/specify it. Often it's documented. As long as you don't need to deal with portable code, this is much simpler as you have control of the environment or you need to deal with a specific environment only. Write code defensively with encoding in mind and you should be fine.网络服务器可能配置为发送不适当的标头,因此建议在应用程序级别覆盖它们。例如:
添加 HTML 元内容类型:
使用
htmlspecialchars()
而不是htmlentities()
因为前者在 utf-8 中就足够了后者默认与 utf-8 不兼容。对于正则表达式,请使用 u 修饰符。例如:
这是检查给定字符串是否为有效 utf-8 字符串的最可靠方法:
如果您使用数据库,则始终在建立连接后立即设置适当的连接编码。 MySQL 示例:
同时检查数据库中的列是否采用 utf-8 格式。并不总是需要,但建议这样做。
Webserver may be configured to send inappropriate headers, so it's recommended to override them in application level. For instance:
Add HTML meta content-type:
Use
htmlspecialchars()
instead ofhtmlentities()
because the former is enough in utf-8 and the latter is incompatible with utf-8 by default.For regular expressions use u modifier. For example:
Together this is the most reliable way to check if the given string is valid utf-8 string:
If you use the database then always set appropriate connection encoding right after the connection is made. Example for MySQL:
Also check if columns in the database are in utf-8. It's not always needed but recomended.
基本上我做了三件事来正确使用捷克语:
1)在 PHP 中定义语言环境:
所以你会使用类似的东西:
基于当前切换到的语言。
2)定义数据库的字符集:
3)定义PHP/HTML代码的字符集:
我不使用任何.htaccess设置。您可以根据您的情况修改此设置,在区域设置中使用类似
en_US.utf8
(基于当前切换到的语言),在字符集中使用 utf-8 而不是 latin2/iso-8859-2它应该运作良好。Basically I do three things to work correctly with czech language:
1) define locale in PHP:
so you would use something like:
based on language which is currently switched to.
2) define charset for the database:
3) define the charset of PHP/HTML code:
I don't use any .htaccess setting. You can modify this for your case, in locale use something like
en_US.utf8
(based on language currently which is currently switched to), in charset use utf-8 instead of latin2/iso-8859-2 and it should work well.尝试以下方法之一:
Try one of the following: