Unicode 未知“�” PHP 中的字符检测

发布于 2024-10-09 14:21:54 字数 286 浏览 12 评论 0原文

PHP 有没有办法检测以下字符

我目前正在使用几种不同的算法修复许多 UTF-8 编码问题,并且需要能够检测 是否存在于字符串中。如何使用 strpos 做到这一点?

简单地将角色粘贴到我的代码库中似乎不起作用。

if (strpos($names['decode'], '?') !== false || strpos($names['decode'], '�') !== false)

Is there any way in PHP of detecting the following character ?

I'm currently fixing a number of UTF-8 encoding issues with a few different algorithms and need to be able to detect if is present in a string. How do I do so with strpos?

Simply pasting the character into my codebase does not seem to work.

if (strpos($names['decode'], '?') !== false || strpos($names['decode'], '�') !== false)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

一梦等七年七年为一梦 2024-10-16 14:21:54

使用 //IGNORE 参数使用 iconv() 将 UTF-8 字符串转换为 UTF-8 会产生删除无效 UTF-8 字符的结果。

因此,您可以通过比较 iconv 操作前后的字符串长度来检测损坏的字符。如果它们不同,则它们包含损坏的字符。

测试用例(确保将文件保存为 UTF-8):

<?php

header("Content-type: text/html; charset=utf-8");

$teststring = "Düsseldorf";

// Deliberately create broken string
// by encoding the original string as ISO-8859-1
$teststring_broken = utf8_decode($teststring); 

echo "Broken string: ".$teststring_broken ;

echo "<br>";

$teststring_converted = iconv("UTF-8", "UTF-8//IGNORE", $teststring_broken );

echo $teststring_converted;

echo "<br>";

if (strlen($teststring_converted) != strlen($teststring_broken  ))
 echo "The string contained an invalid character";

理论上,您可以删除 //IGNORE 并简单地测试失败(空)的 iconv 操作,但 iconv 失败可能还有其他原因,而不仅仅是无效字符......我不知道。我会使用比较方法。

Converting a UTF-8 string into UTF-8 using iconv() using the //IGNORE parameter produces a result where invalid UTF-8 characters are dropped.

Therefore, you can detect a broken character by comparing the length of the string before and after the iconv operation. If they differ, they contained a broken character.

Test case (make sure you save the file as UTF-8):

<?php

header("Content-type: text/html; charset=utf-8");

$teststring = "Düsseldorf";

// Deliberately create broken string
// by encoding the original string as ISO-8859-1
$teststring_broken = utf8_decode($teststring); 

echo "Broken string: ".$teststring_broken ;

echo "<br>";

$teststring_converted = iconv("UTF-8", "UTF-8//IGNORE", $teststring_broken );

echo $teststring_converted;

echo "<br>";

if (strlen($teststring_converted) != strlen($teststring_broken  ))
 echo "The string contained an invalid character";

in theory, you could drop //IGNORE and simply test for a failed (empty) iconv operation, but there might be other reasons for a iconv to fail than just invalid characters... I don't know. I would use the comparison method.

成熟稳重的好男人 2024-10-16 14:21:54

当我期望的时候,我会执行以下操作来检测和纠正未以 UTF-8 编码的字符串的编码:

    $encoding = mb_detect_encoding($str, 'utf-8, iso-8859-1, ascii', true);
    if (strcasecmp($encoding, 'UTF-8') !== 0) {
      $str = iconv($encoding, 'utf-8', $str);
    }

Here is what I do to detect and correct the encoding of strings not encoded in UTF-8 when that is what I am expecting:

    $encoding = mb_detect_encoding($str, 'utf-8, iso-8859-1, ascii', true);
    if (strcasecmp($encoding, 'UTF-8') !== 0) {
      $str = iconv($encoding, 'utf-8', $str);
    }
森末i 2024-10-16 14:21:54

据我所知,那个问号符号不是单个字符。标准字体集中有许多不同的字符代码未映射到符号,这是使用的默认符号。要在 PHP 中进行检测,您首先需要知道您正在使用的是什么字体。然后您需要查看字体实现并查看哪些范围的代码映射到“?”符号,然后查看给定字符是否在这些范围之一内。

As far as I know, that question mark symbol is not a single character. There are many different character codes in the standard font sets that are not mapped to a symbol, and that is the default symbol that is used. To do detection in PHP, you would first need to know what font it is that you're using. Then you need to look at the font implementation and see what ranges of codes map to the "?" symbol, and then see if the given character is in one of those ranges.

菊凝晚露 2024-10-16 14:21:54

我使用 CUSTOM 方法(使用 str_replace)来清理未定义的字符:

    $input='a³';

    $text=str_replace("\n\n",  "sample000"        ,$text);
    $text=str_replace("\n",    "sample111"        ,$text);

    $text=filter_var($text,FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);

    $text=str_replace("sample000",  "<br/><br/>"  ,$text);
    $text=str_replace("sample111",  "<br/>"       ,$text);

    echo $text; //outputs ------------>   a3

I use the CUSTOM method (using str_replace) to sanitize undefined characters:

    $input='a³';

    $text=str_replace("\n\n",  "sample000"        ,$text);
    $text=str_replace("\n",    "sample111"        ,$text);

    $text=filter_var($text,FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);

    $text=str_replace("sample000",  "<br/><br/>"  ,$text);
    $text=str_replace("sample111",  "<br/>"       ,$text);

    echo $text; //outputs ------------>   a3
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文