我正在使用 Google Translate 将一些错误代码转换为带有Perl的Farsi。 Farsi就是一个这样的例子,我也以其他语言找到了这个问题 - 但是对于此讨论,我将坚持一个示例:
“几何数据卡错误”的翻译文本正常工作(示例1),但翻译“附加默认的111张卡”(示例2)给出了“宽字符”错误。
这两个示例都可以从终端运行,它们只是打印。
我已经尝试过通常的东西 bute bute bute bute oval over:
use utf8;
use open ':std', ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
示例1:这起作用
perl -Mutf8 -le 'print "\x{d8}\x{ae}\x{d8}\x{b7}\x{d8}\x{a7}\x{db}\x{8c} \x{da}\x{a9}\x{d8}\x{a7}\x{d8}\x{b1}\x{d8}\x{aa} \x{d8}\x{af}\x{d8}\x{a7}\x{d8}\x{af}\x{d9}\x{87} \x{d9}\x{87}\x{d9}\x{86}\x{d8}\x{af}\x{d8}\x{b3}\x{db}\x{8c}"'
خطای کارت داده هندسی
示例2:这会产生广泛的字符警告并打印噪声
perl -Mutf8 -le 'print "\x{d8}\x{a7}\x{d9}\x{81}\x{d8}\x{b2}\x{d9}\x{88}\x{d8}\x{af}\x{d9}\x{86} \x{db}\x{8c}\x{da}\x{a9} \x{da}\x{a9}\x{d8}\x{a7}\x{d8}\x{b1}\x{d8}\x{aa} \x{d9}\x{be}\x{db}\x{8c}\x{d8}\x{b4}\x{200c}\x{d9}\x{81}\x{d8}\x{b1}\x{d8}\x{b6} 111"'
Wide character in print at -e line 1.
# <terminal noise, not Farsi text>
使用curl
如果我使用 curl
进行相同的请求,我得到了这一点:
curl 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=auto&tl=fa&hl=fa&dt=t&ie=UTF-8&oe=UTF-8&otf=1&ssel=0&tsel=0&tk=xxxx&dt=dj&q=%41%70%70%65%6E%64%69%6E%67%20%61%20%64%65%66%61%75%6C%74%20%31%31%31%20%63%61%72%64'
[[["افزودن یک کارت پیش\u200cفرض 111","Appending a default 111 card",null,null,3,null,null,[[]],[[["982c75c78c6c8e6005ec3a4021a7f785","tea_GrecoIndoEuropeA_en2elfahykakumksq_2021q3.md"]]]]],null,"en",null,null,null,1,[],[["en"],null,[1],["en"]]]
注意<<代码> \ U200C 在JSON输出中,该输出为a Unicode Char。当 JSON :: FROF_JSON
解析 \ u200c
它会爆炸:
perl -Mutf8 -MJSON -e 'print from_json("[\"\\u200c\"]")->[0];'
Wide character in print at -e line 1.
我可以这样“修复”它:
my $c = $res->content;
$c =~ s/\\u[0-9a-f]{4}//;
my $json = from_json($c);
然后输出文本是正确的(左右) :
افزودن یک کارت پیشفرض 111
问题:这里发生了什么?
- 这是Perl还是JSON中的错误?
-
\ u200c
是否应该以其他方式正确解析?
I'm using Google Translate to convert some error codes into Farsi with Perl. Farsi is one such example, I've also found this issue in other languages---but for this discussion I'll stick to the single example:
The translated text of "Geometry data card error" works fine (Example 1) but translating "Appending a default 111 card" (Example 2) gives the "Wide character" error.
Both examples can be run from the terminal, they are just prints.
I've tried the usual things like these, but to no avail:
use utf8;
use open ':std', ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
Example 1: This works
perl -Mutf8 -le 'print "\x{d8}\x{ae}\x{d8}\x{b7}\x{d8}\x{a7}\x{db}\x{8c} \x{da}\x{a9}\x{d8}\x{a7}\x{d8}\x{b1}\x{d8}\x{aa} \x{d8}\x{af}\x{d8}\x{a7}\x{d8}\x{af}\x{d9}\x{87} \x{d9}\x{87}\x{d9}\x{86}\x{d8}\x{af}\x{d8}\x{b3}\x{db}\x{8c}"'
خطای کارت داده هندسی
Example 2: This produces Wide char warnings and prints noise
perl -Mutf8 -le 'print "\x{d8}\x{a7}\x{d9}\x{81}\x{d8}\x{b2}\x{d9}\x{88}\x{d8}\x{af}\x{d9}\x{86} \x{db}\x{8c}\x{da}\x{a9} \x{da}\x{a9}\x{d8}\x{a7}\x{d8}\x{b1}\x{d8}\x{aa} \x{d9}\x{be}\x{db}\x{8c}\x{d8}\x{b4}\x{200c}\x{d9}\x{81}\x{d8}\x{b1}\x{d8}\x{b6} 111"'
Wide character in print at -e line 1.
# <terminal noise, not Farsi text>
Using Curl
If I do the same request with curl
I get this:
curl 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=auto&tl=fa&hl=fa&dt=t&ie=UTF-8&oe=UTF-8&otf=1&ssel=0&tsel=0&tk=xxxx&dt=dj&q=%41%70%70%65%6E%64%69%6E%67%20%61%20%64%65%66%61%75%6C%74%20%31%31%31%20%63%61%72%64'
[[["افزودن یک کارت پیش\u200cفرض 111","Appending a default 111 card",null,null,3,null,null,[[]],[[["982c75c78c6c8e6005ec3a4021a7f785","tea_GrecoIndoEuropeA_en2elfahykakumksq_2021q3.md"]]]]],null,"en",null,null,null,1,[],[["en"],null,[1],["en"]]]
Notice the \u200c
in the JSON output above which is a "Zero Width Non-Joiner" unicode char. When JSON::from_json
parses the \u200c
it blows up:
perl -Mutf8 -MJSON -e 'print from_json("[\"\\u200c\"]")->[0];'
Wide character in print at -e line 1.
I can "fix" it like this:
my $c = $res->content;
$c =~ s/\\u[0-9a-f]{4}//;
my $json = from_json($c);
and then the output text is correct (right-to-left):
افزودن یک کارت پیشفرض 111
Question: What is going on here?
- Is this a bug in Perl or a JSON?
- Should
\u200c
be parsed properly in some other way?
发布评论
评论(2)
这里发生了很多事情。我认为很多,尤其是前两个例子,源于不理解 Perl 的两种字符串模式(面向字节和面向 Unicode 代码点)之间的区别。
示例 1 是一个原始字节字符串,其中包含恰好采用 UTF-8 编码的字节,并且不加修改地传递;只要显示输出的终端需要 UTF-8,它们就会正确呈现。示例 2 有一个“宽”字符(值大于 255),使其成为 Unicode 字符串,其中由大于 127 的
\x{NN}
数字表示的每个字符都是一个 Unicode 代码点,以 UTF-8 编码为多个字节。打印此内容会导致 mojibake 和警告,因为标准输出是面向字节的,没有翻译层。正如我在评论中建议的那样,阅读
perluniintro
(以及其他与 unicode 相关的文档)是了解事物如何工作的良好开端。但在实际任务中,从
curl
命令返回的 JSON 中提取文本...如果这是一个 shell 脚本,我会使用jq
来代替:等效的 perl 一行:
-CS
参数告诉 perl 标准输入、输出和错误都是 UTF-8 编码的。您还可以使用-CO
使其只是标准输出,并使用decode_json()
代替,它需要原始 UTF-8 编码字节而不是 Unicode 字符串。在脚本而不是单行脚本中,使用
JSON
的 OO 接口并使用其方法调整输入字符串的编码方式,加上open
编译指示(或 < code>binmode 或open
的编码层)而不是-C
选项,是正确的方法。There's a lot of stuff going on here. I think a lot of it, especially in the first two examples, stems from not understanding the difference between perl's two string modes (byte oriented and Unicode codepoint oriented).
Example 1 is a raw byte string holding bytes that happen to be UTF-8 encoded, and are passed through unchanged; as long as the terminal that's displaying the output is expecting UTF-8, they'll be rendered correctly. Example 2 has a 'wide' character (With a value greater than 255), making it a Unicode string, where each character represented by a
\x{NN}
number greater than 127 is a Unicode codepoint that is encoded as multiple bytes in UTF-8. Printing this causes mojibake and a warning because standard output is byte oriented without a translation layer.As I suggested in a comment, reading
perluniintro
(And the other unicode-related documentation) is a good start for learning how things work.But on to the actual task, extracting text from the JSON returned by your
curl
commands... I'd usejq
instead if this is for a shell script:Compare to the equivalent perl one-liner:
The
-CS
argument tells perl that standard input, output, and error are all UTF-8 encoded. You could also use-CO
to make that just standard output, and usedecode_json()
instead, which expects raw UTF-8 encoded bytes instead of a Unicode string.And in a script instead of a one-liner, using the OO interface to
JSON
and tuning how input strings should be encoded using its methods, plus theopen
pragma (Orbinmode
or an encoding layer foropen
) instead of the-C
option, is the way to go.JSON 对象需要启用 utf8,它将修复
\u200c
。感谢 @Shawn 为我指明了正确的方向:现在,在返回 JSON 时,像
\u200c
这样的 JSON 格式文本内容会正确音译为\xe2\x80\x8c
哈希。The JSON object needs to have utf8 enabled and it will fix the
\u200c
. Thanks to @Shawn for pointing me in the right direction:Now the JSON-formatted text content like
\u200c
is correctly transliterated to\xe2\x80\x8c
when returning the JSON hash.