如何正确处理 R 中的转义 Unicode 字符,例如破折号 (—)
我在处理 R 中转义的 unicode 字符时遇到问题,特别是从 MediaWiki API 获取信息时遇到的问题。我会找到一个 JSON 字符串,例如
{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}
Which should be完全有效,但是当通过 fromJSON()
读取时,我得到:
snip...
[1] "Banach\023Tarski paradox"
最初我认为这只是 RJSONIO 的问题,但我遇到了类似的问题 scan() 和 readLines()。我的猜测是我错过了一些非常基本的东西。
我实际上无法仅使用 R 给出完全可重现的示例,因为如果我通过 write() (或某些等效函数)将“em\u2013dash”发送到文件,R 将自动转换 em 破折号。所以就这样吧。使用以下内容创建一个名为 test1 的文本文件:
"em\u2013dash" "em–dash" " em \u2013 dash"
然后加载 R(无论文件路径是什么):
> scan( file = "~/R/test1", what = "character", encoding = "UTF-8")
Read 3 items
[1] "em\\u2013dash" "em–dash" " em \\u2013 dash"
> readLines("~/R/test1", warn = FALSE, encoding = "UTF-8")
[1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\""
添加的转义字符是导致我的 fromJSON()
问题的原因。我可以把它们去掉,但我可能会在这个过程中破坏其他东西,我想有一个更简单的解决方案。谢谢。
这是会话信息:
R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C/en_US.UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RJSONIO_0.98-0
loaded via a namespace (and not attached):
[1] tools_2.14.1
I'm having trouble handling escaped unicode characters in R, specifically those encountered when grabbing information from the MediaWiki API. I would find a JSON string like
{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}
Which should be perfectly valid but when read in through fromJSON()
I get:
snip...
[1] "Banach\023Tarski paradox"
Initially I thought this was just a problem with RJSONIO, but I encounter similar problems with scan()
and readLines()
. My guess is that I am missing something very basic.
I can't actually give a completely reproducible example using only R because if I send "em\u2013dash" to a file through write() (or some equivalent function) R will automatically convert the em dash. So here goes. Create a text file named test1 with the following:
"em\u2013dash" "em–dash" " em \u2013 dash"
Then load up R (for whatever the file path is):
> scan( file = "~/R/test1", what = "character", encoding = "UTF-8")
Read 3 items
[1] "em\\u2013dash" "em–dash" " em \\u2013 dash"
> readLines("~/R/test1", warn = FALSE, encoding = "UTF-8")
[1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\""
The added escape character is what causes my problems with fromJSON()
. I could just strip them out but I'd probably break something else in the process and I imagine there is an easier solution. Thanks.
Here's the session info:
R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] C/en_US.UTF-8/C/C/C/C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RJSONIO_0.98-0
loaded via a namespace (and not attached):
[1] tools_2.14.1
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这实际上并不是 RJSONIO 中的错误。它被设计为期望一个已被 R 读取且已处理非 ASCII 字符的字符串。当一个人向它传递一个带有 \u 的字符串时,该字符串尚未被处理而是转义了。
在我的计算机上,区域设置设置为 en_US.UTF-8,该命令
会生成
Note that the character is prefixed with
\u
not\\u
。当您只需输入该字符串时,看看它在 R 中的显示方式。
所以问题出在 fromJSON() 的上游,为什么字符串包含 \u。
我可能会在 RJSONIO 中添加支持来处理此类未处理的字符串。
This is not in fact a bug in RJSONIO. It is designed to expect a string that has been read by R and which has the non-ASCII characters already processed. When one passes it a string with \u, that has not been processed but escaped.
On my machine with a locale set to en_US.UTF-8, the command
produces
Note that the character is prefixed by
\u
not\\u
.See how it appears in R when you simply enter that string.
So the issue is upstream of fromJSON() as to why the string contains \u.
I may add support in RJSONIO for handling such unprocessed strings.
这是 RJSONIO 中的一个错误,您可以清楚地看到:
它在 rjson 中工作得很好:
并证明它是正确的值:
在您的分析中,您对打印感到困惑字符串与实际字符串。
print
引用其内容进行打印 - 如果您想查看实际的字符串,可以使用cat
或charToRaw
。此外,scan
不会解释任何转义,因此您会得到您所提供的内容。It is a bug in
RJSONIO
as you can clearly see:It works just fine in
rjson
:and to prove it is the correct value:
In your analysis you got confused by printed string vs actual strings.
print
quotes its content for printing - if you want to see the actual string, you can usecat
orcharToRaw
. Alsoscan
doesn't interpret any escapes, so you get what you give it.我认为根本问题是
libjson
选项 RJSONIO 中未启用“nofollow">JSON_UNICODE
。然而,当输入是UTF-8
编码时,问题似乎不会显现出来:仅当输入使用 JSON 转义字符时,问题才会出现。在这些情况下,
RJSONIO
似乎生成latin1
输出,但未标记正确设置编码:对于这个简单的示例,我们可以通过手动将编码设置为< code>latin1:
但是,这当然不适用于
latin1
集之外的字符:I think the underlying problem is that the
libjson
optionJSON_UNICODE
is not enabled inRJSONIO
. However it seems like the problem does not manifest itself when the input isUTF-8
encoded:The problem only appears when the input uses JSON escaped characters. In these cases,
RJSONIO
seems to generatelatin1
output, but doesn't mark set the encoding correctly:For this simple example we can fix it by manually setting the encoding to
latin1
:However, this of course won't work for characters outside the
latin1
set: