如何正确处理 R 中的转义 Unicode 字符，例如破折号 (—)

发布于 2025-01-04 01:42:54 字数 1384 浏览 0 评论 0原文

我在处理 R 中转义的 unicode 字符时遇到问题，特别是从 MediaWiki API 获取信息时遇到的问题。我会找到一个 JSON 字符串，例如

{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}

Which should be完全有效，但是当通过 fromJSON() 读取时，我得到：

snip...
[1] "Banach\023Tarski paradox"

最初我认为这只是 RJSONIO 的问题，但我遇到了类似的问题 scan() 和 readLines()。我的猜测是我错过了一些非常基本的东西。

我实际上无法仅使用 R 给出完全可重现的示例，因为如果我通过 write() （或某些等效函数）将“em\u2013dash”发送到文件，R 将自动转换 em 破折号。所以就这样吧。使用以下内容创建一个名为 test1 的文本文件：

"em\u2013dash" "em–dash" " em \u2013 dash"

然后加载 R（无论文件路径是什么）：

> scan( file = "~/R/test1", what = "character", encoding = "UTF-8")
Read 3 items
[1] "em\\u2013dash"    "em–dash"          " em \\u2013 dash"
> readLines("~/R/test1", warn = FALSE, encoding = "UTF-8")
[1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\""

添加的转义字符是导致我的 fromJSON() 问题的原因。我可以把它们去掉，但我可能会在这个过程中破坏其他东西，我想有一个更简单的解决方案。谢谢。

这是会话信息：

R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/en_US.UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RJSONIO_0.98-0

loaded via a namespace (and not attached):
[1] tools_2.14.1

原文

I'm having trouble handling escaped unicode characters in R, specifically those encountered when grabbing information from the MediaWiki API. I would find a JSON string like

{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}

Which should be perfectly valid but when read in through fromJSON() I get:

snip...
[1] "Banach\023Tarski paradox"

Initially I thought this was just a problem with RJSONIO, but I encounter similar problems with scan() and readLines(). My guess is that I am missing something very basic.

I can't actually give a completely reproducible example using only R because if I send "em\u2013dash" to a file through write() (or some equivalent function) R will automatically convert the em dash. So here goes. Create a text file named test1 with the following:

"em\u2013dash" "em–dash" " em \u2013 dash"

Then load up R (for whatever the file path is):

> scan( file = "~/R/test1", what = "character", encoding = "UTF-8")
Read 3 items
[1] "em\\u2013dash"    "em–dash"          " em \\u2013 dash"
> readLines("~/R/test1", warn = FALSE, encoding = "UTF-8")
[1] "\"em\\u2013dash\" \"em–dash\" \" em \\u2013 dash\""

The added escape character is what causes my problems with fromJSON(). I could just strip them out but I'd probably break something else in the process and I imagine there is an easier solution. Thanks.

Here's the session info:

R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] C/en_US.UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RJSONIO_0.98-0

loaded via a namespace (and not attached):
[1] tools_2.14.1

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

撩起发的微风 2025-01-11 01:42:54

这实际上并不是 RJSONIO 中的错误。它被设计为期望一个已被 R 读取且已处理非 ASCII 字符的字符串。当一个人向它传递一个带有 \u 的字符串时，该字符串尚未被处理而是转义了。
在我的计算机上，区域设置设置为 en_US.UTF-8，该命令

fromJSON('{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}')

会生成

$query
$query$categorymembers
$query$categorymembers[[1]]
$query$categorymembers[[1]]$ns
[1] 0

$query$categorymembers[[1]]$title
[1] "Banach–Tarski paradox"

Note that the character is prefixed with \u not \\u。
当您只需输入该字符串时，看看它在 R 中的显示方式。

所以问题出在 fromJSON() 的上游，为什么字符串包含 \u。
我可能会在 RJSONIO 中添加支持来处理此类未处理的字符串。

This is not in fact a bug in RJSONIO. It is designed to expect a string that has been read by R and which has the non-ASCII characters already processed. When one passes it a string with \u, that has not been processed but escaped.
On my machine with a locale set to en_US.UTF-8, the command

fromJSON('{"query":{"categorymembers":[{"ns":0,"title":"Banach\u2013Tarski paradox"}]}}')

produces

$query
$query$categorymembers
$query$categorymembers[[1]]
$query$categorymembers[[1]]$ns
[1] 0

$query$categorymembers[[1]]$title
[1] "Banach–Tarski paradox"

Note that the character is prefixed by \u not \\u.
See how it appears in R when you simply enter that string.

So the issue is upstream of fromJSON() as to why the string contains \u.
I may add support in RJSONIO for handling such unprocessed strings.

回复收藏 0 原文

城歌 2025-01-11 01:42:54

这是 RJSONIO 中的一个错误，您可以清楚地看到：

> RJSONIO::fromJSON('{"x":"foo\\u2013bar"}')
           x 
"foo\023bar"

它在 rjson 中工作得很好：

> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo–bar"

并证明它是正确的值：

 > Sys.setlocale("LC_ALL", "C")
[1] "C/C/C/C/C/en_US.UTF-8"
> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo<U+2013>bar"

在您的分析中，您对打印感到困惑字符串与实际字符串。 print 引用其内容进行打印 - 如果您想查看实际的字符串，可以使用 cat 或 charToRaw。此外，scan 不会解释任何转义，因此您会得到您所提供的内容。

It is a bug in RJSONIO as you can clearly see:

> RJSONIO::fromJSON('{"x":"foo\\u2013bar"}')
           x 
"foo\023bar"

It works just fine in rjson:

> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo–bar"

and to prove it is the correct value:

 > Sys.setlocale("LC_ALL", "C")
[1] "C/C/C/C/C/en_US.UTF-8"
> rjson::fromJSON('{"x":"foo\\u2013bar"}')
$x
[1] "foo<U+2013>bar"

In your analysis you got confused by printed string vs actual strings. print quotes its content for printing - if you want to see the actual string, you can use cat or charToRaw. Also scan doesn't interpret any escapes, so you get what you give it.

回复收藏 0 原文

深居我梦 2025-01-11 01:42:54

我认为根本问题是 libjson 选项 RJSONIO 中未启用“nofollow">JSON_UNICODE。然而，当输入是 UTF-8 编码时，问题似乎不会显现出来：

library(RJSONIO)
x = "北京填鴨们"
identical(x, fromJSON(toJSON(x)))
# [1] TRUE

仅当输入使用 JSON 转义字符时，问题才会出现。在这些情况下，RJSONIO 似乎生成 latin1 输出，但未标记正确设置编码：

x <- fromJSON('["Z\\u00FCrich"]')
print(x)
# [1] "Z\xfcrich"

nchar(x)
#Error in nchar(x) : invalid multibyte string 1

对于这个简单的示例，我们可以通过手动将编码设置为< code>latin1：

#Set the correct encoding
Encoding(x) <- "latin1"
print(x)
#[1] "Zürich"

但是，这当然不适用于 latin1 集之外的字符：

#This should be: "填"
fromJSON('["\\u586B"]')

I think the underlying problem is that the libjson option JSON_UNICODE is not enabled in RJSONIO. However it seems like the problem does not manifest itself when the input is UTF-8 encoded:

library(RJSONIO)
x = "北京填鴨们"
identical(x, fromJSON(toJSON(x)))
# [1] TRUE

The problem only appears when the input uses JSON escaped characters. In these cases, RJSONIO seems to generate latin1 output, but doesn't mark set the encoding correctly:

x <- fromJSON('["Z\\u00FCrich"]')
print(x)
# [1] "Z\xfcrich"

nchar(x)
#Error in nchar(x) : invalid multibyte string 1

For this simple example we can fix it by manually setting the encoding to latin1:

#Set the correct encoding
Encoding(x) <- "latin1"
print(x)
#[1] "Zürich"

However, this of course won't work for characters outside the latin1 set:

#This should be: "填"
fromJSON('["\\u586B"]')

回复收藏 0 原文

~没有更多了~

关于作者

我为君王

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何正确处理 R 中的转义 Unicode 字符，例如破折号 (—)

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

如何正确处理 R 中的转义 Unicode 字符，例如破折号 (—)

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。