土耳其人物问题在绘制图形的图表时问题

发布于 01-19 15:26 字数 1099 浏览 2 评论 0原文

我有一个数据集,其中包括土耳其语的推文。我正在尝试使用TM软件包进行文本挖掘,并使用IGRAPH R软件包绘制网络。

    library(tm)
#build corpus
corpus <- iconv(deneme$text, to= "utf-8-mac")
corpus <- Corpus(VectorSource(corpus))
removeURL <- function(x) gsub('http[[:alnum:]]*', '', x)
cleanset <- tm_map(corpus, content_transformer(removeURL))
cleanset <- tm_map(cleanset, stripWhitespace)
#term document matrix
tdm <- TermDocumentMatrix(cleanset)
tdm <- as.matrix(tdm)
tdm <- tdm[rowSums(tdm)>30,]
tdm[tdm>1] <- 1
termM <- tdm %*% t(tdm)
#Network
g <- graph.adjacency(termM, weighted = T, mode = 'undirected') 
g <- simplify(g)
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
#plot
plot(g,
     vertex.color='green',
     vertex.size = 3,
     vertex.label.dist = 1.5)

Output plot

Turkish charachters such as "ş ğ ü" do not appear correctly.有什么问题?

这是我的R Studio语言环境设置:

Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

I have a dataset which includes Tweets in Turkish language. I'm trying to do text mining with tm package and plot the networks with igraph R packages.

    library(tm)
#build corpus
corpus <- iconv(deneme$text, to= "utf-8-mac")
corpus <- Corpus(VectorSource(corpus))
removeURL <- function(x) gsub('http[[:alnum:]]*', '', x)
cleanset <- tm_map(corpus, content_transformer(removeURL))
cleanset <- tm_map(cleanset, stripWhitespace)
#term document matrix
tdm <- TermDocumentMatrix(cleanset)
tdm <- as.matrix(tdm)
tdm <- tdm[rowSums(tdm)>30,]
tdm[tdm>1] <- 1
termM <- tdm %*% t(tdm)
#Network
g <- graph.adjacency(termM, weighted = T, mode = 'undirected') 
g <- simplify(g)
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
#plot
plot(g,
     vertex.color='green',
     vertex.size = 3,
     vertex.label.dist = 1.5)

Output plot

Turkish charachters such as "ş ğ ü" do not appear correctly. What might be the problem?

and this is my R studio locale settings:

Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

不及他2025-01-26 15:26:24

我尝试使用 iconvlist() 中的所有可用编码将 iconv() 函数应用于 ş,ğ,ü 字符,但没有任何内容可以打印这些角色完美地出现在 R 控制台和剧情中。我通过使用以下代码做到了这一点:

encoded_text <- list()
for (i in seq_along(iconvlist())) {
  tryCatch(print(eval(substitute(
    encoded_text[[i]] <- unlist(lapply(c("s", "g", "ü"), iconv,
      to = iconvlist()[i]
    ))
  ))),
  error = function(any_error_msg) message(as.character(any_error_msg))
  )
}

#To show all the results: 
encoded_text

我还尝试了 utf-8 包中的 utf8_print("ş,ğ,ü") ,但也失败了。

最后,我找到了readtext包。该软件包可以在我的计算机的控制台和绘图上正确打印这些字符。但是,该包的当前版本(v0.81)只能读取文件,而不能读取字符向量。因此,为了使用这个包,我在记事本中输入了这些字符,并用逗号分隔,然后使用 .txt 扩展名保存了该文件。

输入图片description here

然后,我使用此代码提取这些字符:

library(readtext)
mytext <- readtext('turkish_text.txt', encoding = 'utf-8')
mytext <- unlist(strsplit(mytext$text, ","))
mytext
#[1] "ş" "ğ" "ü"

它们正确地打印在控制台上。然后,我尝试将它们打印在 igraph 对象的绘图上。

adjm <- matrix(1:9, nc=3)
g1 <- graph_from_adjacency_matrix( adjm )
g1 <- g1 %>% set_vertex_attr("name", value = mytext)
plot(g1)

这是结果图:

在此处输入图像描述

字符正确打印在绘图上。

当然不能保证这种方法适用于其他土耳其字符,但我认为值得尝试。

I tried to apply iconv() function to ş,ğ,ü characters by using all available encodings in iconvlist(), but nothing can print these characters perfectly on the R console and on the plot. I did that by using this code:

encoded_text <- list()
for (i in seq_along(iconvlist())) {
  tryCatch(print(eval(substitute(
    encoded_text[[i]] <- unlist(lapply(c("s", "g", "ü"), iconv,
      to = iconvlist()[i]
    ))
  ))),
  error = function(any_error_msg) message(as.character(any_error_msg))
  )
}

#To show all the results: 
encoded_text

I also tried utf8_print("ş,ğ,ü") from utf-8 package, but also failed.

Finally, I found readtext package. This package can print these character properly on the console and on the plot in my computer. However, the current version of this package (v0.81) can only read a file, not a character vector. So, to use this package, I typed these characters in the Notepad, separated by commas, and then I saved the file with .txt extension.

enter image description here

Then, I used this code to extract these characters:

library(readtext)
mytext <- readtext('turkish_text.txt', encoding = 'utf-8')
mytext <- unlist(strsplit(mytext$text, ","))
mytext
#[1] "ş" "ğ" "ü"

They are properly printed on the console. Then, I tried to print them on the plot of an igraph object.

adjm <- matrix(1:9, nc=3)
g1 <- graph_from_adjacency_matrix( adjm )
g1 <- g1 %>% set_vertex_attr("name", value = mytext)
plot(g1)

Here is the resulted plot:

enter image description here

The characters are properly printed on the plot.

Of course no guarantee that this approach will be applicable to other Turkish characters, but I think it's worthy to try.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文