如何处理用户输入的无效 UTF-8 字符

发布于 2024-09-19 07:38:54 字数 1063 浏览 6 评论 0原文

我正在寻找有关如何处理用户的无效 UTF-8 输入的一般策略/建议。

尽管我的 Web 应用程序使用 UTF-8,但某些用户不知何故输入了无效字符。这会导致 PHP 的 json_encode() 出现错误,总体而言似乎是一个坏主意。

W3C I18N 常见问题解答:多语言表单 说“如果非 UTF -8 数据已收到,应发回错误消息。”。

我非常熟悉 mbstring 扩展,并且不会问“UTF-8 在 PHP 中如何工作?”。我希望获得具有实际情况经验的人的建议,他们是如何处理这个问题的。

作为解决方案的一部分,我真的很希望看到一种快速方法将无效字符转换为U+FFFD

I'm looking for a general strategy/advice on how to handle invalid UTF-8 input from users.

Even though my web application uses UTF-8, somehow some users enter invalid characters. This causes errors in PHP's json_encode() and overall seems like a bad idea to have around.

W3C I18N FAQ: Multilingual Forms says "If non-UTF-8 data is received, an error message should be sent back.".

  • How exactly should this be practically done, throughout a site with dozens of different places where data can be input?
  • How do you present the error in a helpful way to the user?
  • How do you temporarily store and display bad form data so the user doesn't lose all their text? Strip bad characters? Use a replacement character, and how?
  • For existing data in the database, when invalid UTF-8 data is detected, should I try to convert it and save it back (how? utf8_encode()? mb_convert_encoding()?), or leave as-is in the database but doing something (what?) before json_encode()?

I'm very familiar with the mbstring extension and am not asking "how does UTF-8 work in PHP?". I'd like advice from people with experience in real-world situations how they've handled this.

As part of the solution, I'd really like to see a fast method to convert invalid characters to U+FFFD.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

比忠 2024-09-26 07:39:10

将 UTF-8 设置为 PHP 代码输出的所有标头的字符集。

在每个 PHP 输出标头中,指定 UTF-8 作为编码:

header('Content-Type: text/html; charset=utf-8');

Set UTF-8 as the character set for all headers output by your PHP code.

In every PHP output header, specify UTF-8 as the encoding:

header('Content-Type: text/html; charset=utf-8');
生生不灭 2024-09-26 07:39:09

尝试执行 Ruby on Rails 所做的操作,强制所有浏览器始终发布 UTF-8 数据:

<form accept-charset="UTF-8" action="#{action}" method="post"><div
    style="margin:0;padding:0;display:inline">
    <input name="utf8" type="hidden" value="✓" />
  </div>
  <!-- form fields -->
</form>

请参阅railssnowman.info初始补丁以获取解释。

  1. 要让浏览器以 UTF-8 编码发送表单提交数据,只需使用“text/html; charset=utf-8”(或使用 meta http-equiv 标签)。

  2. 要让浏览器以 UTF-8 编码发送表单提交数据,即使用户修改页面编码(浏览器允许用户这样做),请使用 accept-charset="UTF-8"< /code> 形式。


  3. 让浏览器以 UTF-8 编码发送表单提交数据,即使用户修改了页面编码(浏览器允许用户这样做),并且即使浏览器是 Internet Explorer 并且用户将页面编码切换为韩语并在表单字段中输入韩语字符,向表单添加隐藏输入其值例如 只能来自 Unicode 字符集(在本例中,不能来自韩语字符集)。

Try doing what Ruby on Rails does to force all browsers always to post UTF-8 data:

<form accept-charset="UTF-8" action="#{action}" method="post"><div
    style="margin:0;padding:0;display:inline">
    <input name="utf8" type="hidden" value="✓" />
  </div>
  <!-- form fields -->
</form>

See railssnowman.info or the initial patch for an explanation.

  1. To have the browser sends form-submission data in the UTF-8 encoding, just render the page with a Content-Type header of "text/html; charset=utf-8" (or use a meta http-equiv tag).

  2. To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), use accept-charset="UTF-8" in the form.

  3. To have the browser sends form-submission data in the UTF-8 encoding, even if the user fiddles with the page encoding (browsers let users do that), and even if the browser is Internet Explorer and the user switched the page encoding to Korean and entered Korean characters in the form fields, add a hidden input to the form with a value such as which can only be from the Unicode charset (and, in this example, not the Korean charset).

删除会话 2024-09-26 07:39:08

删除给定子集之外的所有字符。至少在我的应用程序的某些部分,我不允许使用 [aZ] 和 [0-9] 集之外的字符,例如在用户名中。

您可以构建一个过滤器函数,该函数会默默地去除此范围之外的所有字符,或者在检测到这些字符时返回错误并将决定推送给用户。

Strip all characters outside your given subset. At least in some parts of my application I would not allow using characters outside the [a-Z] and [0-9] sets, for example in usernames.

You can build a filter function that silently strips all characters outside this range, or that returns an error if it detects them and pushes the decision to the user.

冰雪之触 2024-09-26 07:39:06

我建议只是不要让垃圾进入。不要依赖自定义函数,这会使您的系统陷入困境。

只需根据您设计的字母表遍历提交的数据即可。创建一个可接受的字母字符串并逐字节遍历提交的数据,就像它是一个数组一样。将可接受的字符推送到新字符串,并省略不可接受的字符。

您存储在数据库中的数据是由用户触发的数据,但实际上不是用户提供的数据。

<?php
    // Build alphabet
    // Optionally, you can remove characters from this array

    $alpha[] = chr(0); // null
    $alpha[] = chr(9); // tab
    $alpha[] = chr(10); // new line
    $alpha[] = chr(11); // tab
    $alpha[] = chr(13); // carriage return

    for ($i = 32; $i <= 126; $i++) {
        $alpha[] = chr($i);
    }

    /* Remove comment to check ASCII ordinals */

    // /*
    // foreach ($alpha as $key => $val) {
    //     print ord($val);
    //     print '<br/>';
    // }
    // print '<hr/>';
    //*/
    //
    // // Test case #1
    //
    // $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv   ' . chr(160) . chr(127) . chr(126);
    //
    // $string = teststr($alpha, $str);
    // print $string;
    // print '<hr/>';
    //
    // // Test case #2
    //
    // $str = '' . '©?™???';
    // $string = teststr($alpha, $str);
    // print $string;
    // print '<hr/>';
    //
    // $str = '©';
    // $string = teststr($alpha, $str);
    // print $string;
    // print '<hr/>';

    $file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
    $testfile = implode(chr(10), file($file));

    $string = teststr($alpha, $testfile);
    print $string;
    print '<hr/>';


    function teststr(&$alpha, &$str) {
        $strlen = strlen($str);
        $newstr = chr(0); // null
        $x = 0;

        if($strlen >= 2) {

            for ($i = 0; $i < $strlen; $i++) {
                $x++;
                if(in_array($str[$i], $alpha)) {
                    // Passed
                    $newstr .= $str[$i];
                }
                else {
                    // Failed
                    print 'Found out of scope character. (ASCII: ' . ord($str[$i]). ')';
                    print '<br/>';
                    $newstr .= '�';
                }
            }
        }
        elseif($strlen <= 0) {
            // Failed to qualify for test
            print 'Non-existent.';
        }
        elseif($strlen === 1) {
            $x++;
            if(in_array($str, $alpha)) {
                // Passed

                $newstr = $str;
            }
            else {
                // Failed
                print 'Total character failed to qualify.';
                $newstr = '�';
            }
        }
        else {
            print 'Non-existent (scope).';
        }

        if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
            // Skip
        }
        else {
            $newstr = utf8_encode($newstr);
        }

        // Test encoding:
        if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
            print 'UTF-8 :D<br/>';
        }
        else {
            print 'ENCODED: ' . mb_detect_encoding($newstr, "UTF-8") . '<br/>';
        }

        return $newstr . ' (scope: ' . $x . ', ' . $strlen . ')';
    }

I recommend merely not allowing garbage to get in. Don't rely on custom functions, which can bog your system down.

Simply walk the submitted data against an alphabet you design. Create an acceptable alphabet string and walk the submitted data, byte by byte, as if it were an array. Push acceptable characters to a new string, and omit unacceptable characters.

The data you store in your database then is data triggered by the user, but not actually user-supplied data.

<?php
    // Build alphabet
    // Optionally, you can remove characters from this array

    $alpha[] = chr(0); // null
    $alpha[] = chr(9); // tab
    $alpha[] = chr(10); // new line
    $alpha[] = chr(11); // tab
    $alpha[] = chr(13); // carriage return

    for ($i = 32; $i <= 126; $i++) {
        $alpha[] = chr($i);
    }

    /* Remove comment to check ASCII ordinals */

    // /*
    // foreach ($alpha as $key => $val) {
    //     print ord($val);
    //     print '<br/>';
    // }
    // print '<hr/>';
    //*/
    //
    // // Test case #1
    //
    // $str = 'afsjdfhasjhdgljhasdlfy42we875y342q8957y2wkjrgSAHKDJgfcv kzXnxbnSXbcv   ' . chr(160) . chr(127) . chr(126);
    //
    // $string = teststr($alpha, $str);
    // print $string;
    // print '<hr/>';
    //
    // // Test case #2
    //
    // $str = '' . '©?™???';
    // $string = teststr($alpha, $str);
    // print $string;
    // print '<hr/>';
    //
    // $str = '©';
    // $string = teststr($alpha, $str);
    // print $string;
    // print '<hr/>';

    $file = 'http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt';
    $testfile = implode(chr(10), file($file));

    $string = teststr($alpha, $testfile);
    print $string;
    print '<hr/>';


    function teststr(&$alpha, &$str) {
        $strlen = strlen($str);
        $newstr = chr(0); // null
        $x = 0;

        if($strlen >= 2) {

            for ($i = 0; $i < $strlen; $i++) {
                $x++;
                if(in_array($str[$i], $alpha)) {
                    // Passed
                    $newstr .= $str[$i];
                }
                else {
                    // Failed
                    print 'Found out of scope character. (ASCII: ' . ord($str[$i]). ')';
                    print '<br/>';
                    $newstr .= '�';
                }
            }
        }
        elseif($strlen <= 0) {
            // Failed to qualify for test
            print 'Non-existent.';
        }
        elseif($strlen === 1) {
            $x++;
            if(in_array($str, $alpha)) {
                // Passed

                $newstr = $str;
            }
            else {
                // Failed
                print 'Total character failed to qualify.';
                $newstr = '�';
            }
        }
        else {
            print 'Non-existent (scope).';
        }

        if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
            // Skip
        }
        else {
            $newstr = utf8_encode($newstr);
        }

        // Test encoding:
        if(mb_detect_encoding($newstr, "UTF-8") == "UTF-8") {
            print 'UTF-8 :D<br/>';
        }
        else {
            print 'ENCODED: ' . mb_detect_encoding($newstr, "UTF-8") . '<br/>';
        }

        return $newstr . ' (scope: ' . $x . ', ' . $strlen . ')';
    }
轮廓§ 2024-09-26 07:39:04

PHP 有一个多字节扩展。请参阅多字节字符串

您应该尝试mb_check_encoding() 函数。

There is a multibyte extension for PHP. See Multibyte String

You should try the mb_check_encoding() function.

倾听心声的旋律 2024-09-26 07:39:03

为了完整地回答这个问题(不一定是最佳答案)......

function as_utf8($s) {
    return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}

For completeness to this question (not necessarily the best answer)...

function as_utf8($s) {
    return mb_convert_encoding($s, "UTF-8", mb_detect_encoding($s));
}
温柔戏命师 2024-09-26 07:39:02

我编写了一个相当简单的类来检查输入是否为 UTF-8 格式并根据需要运行 utf8_encode() :

class utf8
{

    /**
     * @param array $data
     * @param int $options
     * @return array
     */
    public static function encode(array $data)
    {
        foreach ($data as $key=>$val) {
            if (is_array($val)) {
                $data[$key] = self::encode($val, $options);
            } else {
                if (false === self::check($val)) {
                    $data[$key] = utf8_encode($val);
                }
            }
        }

        return $data;
    }

    /**
     * Regular expression to test a string is UTF8 encoded
     * 
     * RFC3629
     * 
     * @param string $string The string to be tested
     * @return bool
     * 
     * @link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
     */
    public static function check($string)
    {
        return preg_match('%^(?:
            [\x09\x0A\x0D\x20-\x7E]              # ASCII
            | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
            |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
            | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
            |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
            |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
            | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
            |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
            )*$%xs',
            $string);
    }
}

// For example
$data = utf8::encode($_POST);

I put together a fairly simple class to check if input is in UTF-8 and to run through utf8_encode() as needs be:

class utf8
{

    /**
     * @param array $data
     * @param int $options
     * @return array
     */
    public static function encode(array $data)
    {
        foreach ($data as $key=>$val) {
            if (is_array($val)) {
                $data[$key] = self::encode($val, $options);
            } else {
                if (false === self::check($val)) {
                    $data[$key] = utf8_encode($val);
                }
            }
        }

        return $data;
    }

    /**
     * Regular expression to test a string is UTF8 encoded
     * 
     * RFC3629
     * 
     * @param string $string The string to be tested
     * @return bool
     * 
     * @link http://www.w3.org/International/questions/qa-forms-utf-8.en.php
     */
    public static function check($string)
    {
        return preg_match('%^(?:
            [\x09\x0A\x0D\x20-\x7E]              # ASCII
            | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
            |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
            | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
            |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
            |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
            | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
            |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
            )*$%xs',
            $string);
    }
}

// For example
$data = utf8::encode($_POST);
堇年纸鸢 2024-09-26 07:39:00

从 Web 应用程序接收到无效字符可能与 HTML 表单假定的字符集有关。您可以使用 accept-charset 指定表单使用的字符集属性

<form action="..." accept-charset="UTF-8">

您可能还想查看 StackOverflow 上的类似问题,以获取有关如何处理无效字符的指针,例如右侧列中的字符,但我认为向用户发出错误信号比尝试清除那些导致重要数据意外丢失或用户输入意外更改的无效字符要好。

Receiving invalid characters from your web application might have to do with the character sets assumed for HTML forms. You can specify which character set to use for forms with the accept-charset attribute:

<form action="..." accept-charset="UTF-8">

You also might want to take a look at similar questions on Stack Overflow for pointers on how to handle invalid characters, e.g., those in the column to the right, but I think that signaling an error to the user is better than trying to clean up those invalid characters which cause unexpected loss of significant data or unexpected change of your user's inputs.

梦断已成空 2024-09-26 07:38:59

accept-charset="UTF-8" 属性只是浏览器遵循的指南,并且不会强制浏览器以这种方式提交。蹩脚的表单提交机器人就是一个很好的例子...

我通常会忽略坏字符,或者通过 < code>iconv() 或使用不太可靠的 utf8_encode() / utf8_decode() 函数。如果您使用 iconv,您还可以选择音译错误字符。

下面是一个使用 iconv() 的示例:

$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);

如果您想向用户显示错误消息,我可能会以全局方式而不是基于收到的每个值来执行此操作。像这样的事情可能会很好:

function utf8_clean($str)
{
    return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}

$clean_GET = array_map('utf8_clean', $_GET);

if (serialize($_GET) != serialize($clean_GET))
{
    $_GET = $clean_GET;
    $error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}

// $_GET is clean!

您可能还想标准化新行并去除(不)可见的控制字符,如下所示:

function Clean($string, $control = true)
{
    $string = iconv('UTF-8', 'UTF-8//IGNORE', $string);

    if ($control === true)
    {
            return preg_replace('~\p{C}+~u', '', $string);
    }

    return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}

从 UTF-8 转换为 Unicode 代码点的代码:

function Codepoint($char)
{
    $result = null;
    $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

    if (is_array($codepoint) && array_key_exists(1, $codepoint))
    {
        $result = sprintf('U+%04X', $codepoint[1]);
    }

    return $result;
}

echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072

可能比任何其他替代方案都要快,但我还没有对其进行广泛的测试。


示例:

$string = 'hello world�';

// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);

function Bad_Codepoint($string)
{
    $result = array();

    foreach ((array) $string as $char)
    {
        $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

        if (is_array($codepoint) && array_key_exists(1, $codepoint))
        {
            $result[] = sprintf('U+%04X', $codepoint[1]);
        }
    }

    return implode('', $result);
}

这可能就是您正在寻找的内容。

The accept-charset="UTF-8" attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example...

I usually ignore bad characters, either via iconv() or with the less reliable utf8_encode() / utf8_decode() functions. If you use iconv, you also have the option to transliterate bad characters.

Here is an example using iconv():

$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);

If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:

function utf8_clean($str)
{
    return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}

$clean_GET = array_map('utf8_clean', $_GET);

if (serialize($_GET) != serialize($clean_GET))
{
    $_GET = $clean_GET;
    $error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}

// $_GET is clean!

You may also want to normalize new lines and strip (non-)visible control chars, like this:

function Clean($string, $control = true)
{
    $string = iconv('UTF-8', 'UTF-8//IGNORE', $string);

    if ($control === true)
    {
            return preg_replace('~\p{C}+~u', '', $string);
    }

    return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}

Code to convert from UTF-8 to Unicode code points:

function Codepoint($char)
{
    $result = null;
    $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

    if (is_array($codepoint) && array_key_exists(1, $codepoint))
    {
        $result = sprintf('U+%04X', $codepoint[1]);
    }

    return $result;
}

echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072

It is probably faster than any other alternative, but I haven't tested it extensively though.


Example:

$string = 'hello world�';

// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);

function Bad_Codepoint($string)
{
    $result = array();

    foreach ((array) $string as $char)
    {
        $codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));

        if (is_array($codepoint) && array_key_exists(1, $codepoint))
        {
            $result[] = sprintf('U+%04X', $codepoint[1]);
        }
    }

    return implode('', $result);
}

This may be what you were looking for.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文