土耳其人物问题在绘制图形的图表时问题

发布于 01-19 15:26 字数 1099 浏览 2 评论 0原文

我有一个数据集，其中包括土耳其语的推文。我正在尝试使用TM软件包进行文本挖掘，并使用IGRAPH R软件包绘制网络。

    library(tm)
#build corpus
corpus <- iconv(deneme$text, to= "utf-8-mac")
corpus <- Corpus(VectorSource(corpus))
removeURL <- function(x) gsub('http[[:alnum:]]*', '', x)
cleanset <- tm_map(corpus, content_transformer(removeURL))
cleanset <- tm_map(cleanset, stripWhitespace)
#term document matrix
tdm <- TermDocumentMatrix(cleanset)
tdm <- as.matrix(tdm)
tdm <- tdm[rowSums(tdm)>30,]
tdm[tdm>1] <- 1
termM <- tdm %*% t(tdm)
#Network
g <- graph.adjacency(termM, weighted = T, mode = 'undirected') 
g <- simplify(g)
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
#plot
plot(g,
     vertex.color='green',
     vertex.size = 3,
     vertex.label.dist = 1.5)

Output plot

Turkish charachters such as "ş ğ ü" do not appear correctly.有什么问题？

这是我的R Studio语言环境设置：

Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

原文

I have a dataset which includes Tweets in Turkish language. I'm trying to do text mining with tm package and plot the networks with igraph R packages.

    library(tm)
#build corpus
corpus <- iconv(deneme$text, to= "utf-8-mac")
corpus <- Corpus(VectorSource(corpus))
removeURL <- function(x) gsub('http[[:alnum:]]*', '', x)
cleanset <- tm_map(corpus, content_transformer(removeURL))
cleanset <- tm_map(cleanset, stripWhitespace)
#term document matrix
tdm <- TermDocumentMatrix(cleanset)
tdm <- as.matrix(tdm)
tdm <- tdm[rowSums(tdm)>30,]
tdm[tdm>1] <- 1
termM <- tdm %*% t(tdm)
#Network
g <- graph.adjacency(termM, weighted = T, mode = 'undirected') 
g <- simplify(g)
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
#plot
plot(g,
     vertex.color='green',
     vertex.size = 3,
     vertex.label.dist = 1.5)

Output plot

Turkish charachters such as "ş ğ ü" do not appear correctly. What might be the problem?

and this is my R studio locale settings:

Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不及他2025-01-26 15:26:24

我尝试使用 iconvlist() 中的所有可用编码将 iconv() 函数应用于 ş,ğ,ü 字符，但没有任何内容可以打印这些角色完美地出现在 R 控制台和剧情中。我通过使用以下代码做到了这一点：

encoded_text <- list()
for (i in seq_along(iconvlist())) {
  tryCatch(print(eval(substitute(
    encoded_text[[i]] <- unlist(lapply(c("s", "g", "ü"), iconv,
      to = iconvlist()[i]
    ))
  ))),
  error = function(any_error_msg) message(as.character(any_error_msg))
  )
}

#To show all the results: 
encoded_text

我还尝试了 utf-8 包中的 utf8_print("ş,ğ,ü") ，但也失败了。

最后，我找到了readtext包。该软件包可以在我的计算机的控制台和绘图上正确打印这些字符。但是，该包的当前版本（v0.81）只能读取文件，而不能读取字符向量。因此，为了使用这个包，我在记事本中输入了这些字符，并用逗号分隔，然后使用 .txt 扩展名保存了该文件。

然后，我使用此代码提取这些字符：

library(readtext)
mytext <- readtext('turkish_text.txt', encoding = 'utf-8')
mytext <- unlist(strsplit(mytext$text, ","))
mytext
#[1] "ş" "ğ" "ü"

它们正确地打印在控制台上。然后，我尝试将它们打印在 igraph 对象的绘图上。

adjm <- matrix(1:9, nc=3)
g1 <- graph_from_adjacency_matrix( adjm )
g1 <- g1 %>% set_vertex_attr("name", value = mytext)
plot(g1)

这是结果图：

字符正确打印在绘图上。

当然不能保证这种方法适用于其他土耳其字符，但我认为值得尝试。

I tried to apply iconv() function to ş,ğ,ü characters by using all available encodings in iconvlist(), but nothing can print these characters perfectly on the R console and on the plot. I did that by using this code:

encoded_text <- list()
for (i in seq_along(iconvlist())) {
  tryCatch(print(eval(substitute(
    encoded_text[[i]] <- unlist(lapply(c("s", "g", "ü"), iconv,
      to = iconvlist()[i]
    ))
  ))),
  error = function(any_error_msg) message(as.character(any_error_msg))
  )
}

#To show all the results: 
encoded_text

I also tried utf8_print("ş,ğ,ü") from utf-8 package, but also failed.

Finally, I found readtext package. This package can print these character properly on the console and on the plot in my computer. However, the current version of this package (v0.81) can only read a file, not a character vector. So, to use this package, I typed these characters in the Notepad, separated by commas, and then I saved the file with .txt extension.

Then, I used this code to extract these characters:

library(readtext)
mytext <- readtext('turkish_text.txt', encoding = 'utf-8')
mytext <- unlist(strsplit(mytext$text, ","))
mytext
#[1] "ş" "ğ" "ü"

They are properly printed on the console. Then, I tried to print them on the plot of an igraph object.

adjm <- matrix(1:9, nc=3)
g1 <- graph_from_adjacency_matrix( adjm )
g1 <- g1 %>% set_vertex_attr("name", value = mytext)
plot(g1)

Here is the resulted plot:

The characters are properly printed on the plot.

Of course no guarantee that this approach will be applicable to other Turkish characters, but I think it's worthy to try.

回复收藏 0 原文

~没有更多了~

关于作者

腹黑女流氓

暂无简介

文章

28 人气

关注发私信

友情链接

文江博客

土耳其人物问题在绘制图形的图表时问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

qq_jyh6zNJB

晶哥哥很专祎

聆听风音

星

qq_3LFa8Q

奢华的一滴泪

友情链接

土耳其人物问题在绘制图形的图表时问题

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

qq_jyh6zNJB

晶哥哥很专祎

聆听风音

星

qq_3LFa8Q

奢华的一滴泪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。