如何正确处理 R 中的转义 Unicode 字符,例如破折号 (—)

发布于 2025-01-04 01:42:54 字数 1384 浏览 0 评论 0原文

我在处理 R 中转义的 unicode 字符时遇到问题,特别是从 MediaWiki API 获取信息时遇到的问题。我会找到一个 JSON 字符串,例如

{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}

Which should be完全有效,但是当通过 fromJSON() 读取时,我得到:

snip...
[1] "Banach\023Tarski paradox"

最初我认为这只是 RJSONIO 的问题,但我遇到了类似的问题 scan() 和 readLines()。我的猜测是我错过了一些非常基本的东西。

我实际上无法仅使用 R 给出完全可重现的示例,因为如果我通过 write() (或某些等效函数)将“em\u2013dash”发送到文件,R 将自动转换 em 破折号。所以就这样吧。使用以下内容创建一个名为 test1 的文本文件:

"em\u2013dash" "em–dash" " em \u2013 dash"

然后加载 R(无论文件路径是什么):

> scan( file = "~/R/test1", what = "character", encoding = "UTF-8")
Read 3 items
[1] "em\\u2013dash"    "em–dash"          " em \\u2013 dash"
> readLines("~/R/test1", warn = FALSE, encoding = "UTF-8")
[1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\""

添加的转义字符是导致我的 fromJSON() 问题的原因。我可以把它们去掉,但我可能会在这个过程中破坏其他东西,我想有一个更简单的解决方案。谢谢。

这是会话信息:

R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/en_US.UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RJSONIO_0.98-0

loaded via a namespace (and not attached):
[1] tools_2.14.1

I'm having trouble handling escaped unicode characters in R, specifically those encountered when grabbing information from the MediaWiki API. I would find a JSON string like

{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}

Which should be perfectly valid but when read in through fromJSON() I get:

snip...
[1] "Banach\023Tarski paradox"

Initially I thought this was just a problem with RJSONIO, but I encounter similar problems with scan() and readLines(). My guess is that I am missing something very basic.

I can't actually give a completely reproducible example using only R because if I send "em\u2013dash" to a file through write() (or some equivalent function) R will automatically convert the em dash. So here goes. Create a text file named test1 with the following:

"em\u2013dash" "em–dash" " em \u2013 dash"

Then load up R (for whatever the file path is):

> scan( file = "~/R/test1", what = "character", encoding = "UTF-8")
Read 3 items
[1] "em\\u2013dash"    "em–dash"          " em \\u2013 dash"
> readLines("~/R/test1", warn = FALSE, encoding = "UTF-8")
[1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\""

The added escape character is what causes my problems with fromJSON(). I could just strip them out but I'd probably break something else in the process and I imagine there is an easier solution. Thanks.

Here's the session info:

R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/en_US.UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RJSONIO_0.98-0

loaded via a namespace (and not attached):
[1] tools_2.14.1

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

撩起发的微风 2025-01-11 01:42:54

这实际上并不是 RJSONIO 中的错误。它被设计为期望一个已被 R 读取且已处理非 ASCII 字符的字符串。当一个人向它传递一个带有 \u 的字符串时,该字符串尚未被处理而是转义了。
在我的计算机上,区域设置设置为 en_US.UTF-8,该命令

fromJSON('{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}')

会生成

$query
$query$categorymembers
$query$categorymembers[[1]]
$query$categorymembers[[1]]$ns
[1] 0

$query$categorymembers[[1]]$title
[1] "Banach–Tarski paradox"

Note that the character is prefixed with \u not \\u
当您只需输入该字符串时,看看它在 R 中的显示方式。

所以问题出在 fromJSON() 的上游,为什么字符串包含 \u。
我可能会在 RJSONIO 中添加支持来处理此类未处理的字符串。

This is not in fact a bug in RJSONIO. It is designed to expect a string that has been read by R and which has the non-ASCII characters already processed. When one passes it a string with \u, that has not been processed but escaped.
On my machine with a locale set to en_US.UTF-8, the command

fromJSON('{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}')

produces

$query
$query$categorymembers
$query$categorymembers[[1]]
$query$categorymembers[[1]]$ns
[1] 0

$query$categorymembers[[1]]$title
[1] "Banach–Tarski paradox"

Note that the character is prefixed by \u not \\u.
See how it appears in R when you simply enter that string.

So the issue is upstream of fromJSON() as to why the string contains \u.
I may add support in RJSONIO for handling such unprocessed strings.

城歌 2025-01-11 01:42:54

这是 RJSONIO 中的一个错误,您可以清楚地看到:

> RJSONIO::fromJSON('{"x":"foo\\u2013bar"}')
           x 
"foo\023bar" 

它在 rjson 中工作得很好:

> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo–bar"

并证明它是正确的值:

 > Sys.setlocale("LC_ALL", "C")
[1] "C/C/C/C/C/en_US.UTF-8"
> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo<U+2013>bar"

在您的分析中,您对打印感到困惑字符串与实际字符串。 print 引用其内容进行打印 - 如果您想查看实际的字符串,可以使用 catcharToRaw。此外,scan 不会解释任何转义,因此您会得到您所提供的内容。

It is a bug in RJSONIO as you can clearly see:

> RJSONIO::fromJSON('{"x":"foo\\u2013bar"}')
           x 
"foo\023bar" 

It works just fine in rjson:

> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo–bar"

and to prove it is the correct value:

 > Sys.setlocale("LC_ALL", "C")
[1] "C/C/C/C/C/en_US.UTF-8"
> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo<U+2013>bar"

In your analysis you got confused by printed string vs actual strings. print quotes its content for printing - if you want to see the actual string, you can use cat or charToRaw. Also scan doesn't interpret any escapes, so you get what you give it.

深居我梦 2025-01-11 01:42:54

我认为根本问题是 libjson 选项 RJSONIO 中未启用“nofollow">JSON_UNICODE。然而,当输入是 UTF-8 编码时,问题似乎不会显现出来:

library(RJSONIO)
x = "北京填鴨们"
identical(x, fromJSON(toJSON(x)))
# [1] TRUE

仅当输入使用 JSON 转义字符时,问题才会出现。在这些情况下,RJSONIO 似乎生成 latin1 输出,但未标记正确设置编码:

x <- fromJSON('["Z\\u00FCrich"]')
print(x)
# [1] "Z\xfcrich"

nchar(x)
#Error in nchar(x) : invalid multibyte string 1

对于这个简单的示例,我们可以通过手动将编码设置为< code>latin1:

#Set the correct encoding
Encoding(x) <- "latin1"
print(x)
#[1] "Zürich" 

但是,这当然不适用于 latin1 集之外的字符:

#This should be: "填"
fromJSON('["\\u586B"]')

I think the underlying problem is that the libjson option JSON_UNICODE is not enabled in RJSONIO. However it seems like the problem does not manifest itself when the input is UTF-8 encoded:

library(RJSONIO)
x = "北京填鴨们"
identical(x, fromJSON(toJSON(x)))
# [1] TRUE

The problem only appears when the input uses JSON escaped characters. In these cases, RJSONIO seems to generate latin1 output, but doesn't mark set the encoding correctly:

x <- fromJSON('["Z\\u00FCrich"]')
print(x)
# [1] "Z\xfcrich"

nchar(x)
#Error in nchar(x) : invalid multibyte string 1

For this simple example we can fix it by manually setting the encoding to latin1:

#Set the correct encoding
Encoding(x) <- "latin1"
print(x)
#[1] "Zürich" 

However, this of course won't work for characters outside the latin1 set:

#This should be: "填"
fromJSON('["\\u586B"]')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文