替换 R data.frames 中的单词（文本挖掘）

发布于 2024-11-26 17:41:07 字数 846 浏览 3 评论 0原文

我正在使用 SQL 和 R 开发文本挖掘解决方案。

首先，我从 SQL 选择中将数据导入到 R 中，然后用它进行数据挖掘。

这是我得到的：

rawData = sqlQuery(dwhConnect,sqlString) 
a = data.frame(rawData$ENNOTE_NEU)

如果我这样做，

a[[1]][1:3]

您会看到结构：

[1] lorem ipsum li ld ee wö wo di dd
[2] la kdin di da dogs chicken
[3] kd good i need some help

现在我想用我自己的字典进行一些数据清理。一个例子是将 li 替换为 lorem ipsum 和 kd 以及 kdin 替换为 kunde< /strong>

我的问题是如何对整个数据框执行此操作。

 for(i in 1:(nrow(a)))
    {
        a[[1]][i]=gsub( " kd | kdin " , " kunde " ,a[[1]][i])
        a[[1]][i]=gsub( " li " , " lorem ipsum " ,a[[1]][i])
...
    }

有效，但对于大量数据来说速度很慢。

有更好的方法吗？

欢呼队长

原文

I'm working on a Text Mining Solution with SQL and R.

First I Import Data into R from my SQL selection and than I do data mining stuff with it.

Here is what I got:

rawData = sqlQuery(dwhConnect,sqlString) 
a = data.frame(rawData$ENNOTE_NEU)

If I do a

a[[1]][1:3]

you see the structure:

[1] lorem ipsum li ld ee wö wo di dd
[2] la kdin di da dogs chicken
[3] kd good i need some help

Now I want to do some data cleaning with my own dictionary.
An Example would be to replace li with lorem ipsum and kd as well as kdin with kunde

My Problem is how to do that for the whole Data Frame.

 for(i in 1:(nrow(a)))
    {
        a[[1]][i]=gsub( " kd | kdin " , " kunde " ,a[[1]][i])
        a[[1]][i]=gsub( " li " , " lorem ipsum " ,a[[1]][i])
...
    }

works but is slow for a lot of data.

Is there a better way to do that?

cheers The Captain

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情痴 2024-12-03 17:41:07

gsub 是矢量化的，因此您不需要循环。

a[[1]] <- gsub( " kd | kdin " , " kunde " , a[[1]])

更快。

另外，您确定要在正则表达式中添加空格吗？这样您就不会匹配行首或行尾的单词。

gsub is vectorised, so you don't need the loop.

a[[1]] <- gsub( " kd | kdin " , " kunde " , a[[1]])

is quicker.

Also, are you sure you want spaces inside your regexes? That way you won't match words at the start or end of lines.

回复收藏 0 原文

幸福丶如此 2024-12-03 17:41:07

替代方法：完全避免正则表达式。当您有很多不同的单词要搜索时，此方法效果最佳，因为除了第一次之外，您将避免文本操作。

a1 <- c("lorem ipsum li ld ee wö wo di dd","la kdin di da dogs chicken","kd good i need some help")
x <- strsplit(a1, " ",fixed=TRUE) # fixed option avoids regexes which will  be slower

replfxn <- function(vec,word.in,word.out) {
  vec[vec %in% word.in] <- word.out
  vec
}

word.in <- "kdin"
word.out <- "kunde"

replfxn(x[[2]],word.in,word.out)

lapply(x,replfxn,word.in=word.in,word.out=word.out)
[[1]]
[1] "lorem" "ipsum" "li"    "ld"    "ee"    "wö"    "wo"    "di"    "dd"   

[[2]]
[1] "la"      "kunde"   "di"      "da"      "dogs"    "chicken"

[[3]]
[1] "kd"   "good" "i"    "need" "some" "help"

对于要搜索的大量单词，我猜这比正则表达式更快。它也更适合数据代码分离，因为它适合编写合并或类似的函数来从文件中读取字典，而不是将其嵌入代码中。

如果您确实需要将其恢复为原始格式（作为空格分隔的字符向量），您可以对结果应用粘贴。

这是计时结果。我纠正了：看起来 gsub 更快！

library(microbenchmark)
microbenchmark(
  gsub( word.in , word.out , a1) ,
  lapply(x,replfxn,word.in=word.in,word.out=word.out) ,
  times = 1000
  )

                                                        expr    min     lq
1                                gsub(word.in, word.out, a1)  42772  44484
2 lapply(x, replfxn, word.in = word.in, word.out = word.out) 102653 106075
  median       uq    max
1  47905  48761.0 691193
2 109496 111635.5 970065

Alternative approach: avoid regexes altogether. This works best when you have a lot of different words to search, because you'll avoid the text manipulation except for the first time.

a1 <- c("lorem ipsum li ld ee wö wo di dd","la kdin di da dogs chicken","kd good i need some help")
x <- strsplit(a1, " ",fixed=TRUE) # fixed option avoids regexes which will  be slower

replfxn <- function(vec,word.in,word.out) {
  vec[vec %in% word.in] <- word.out
  vec
}

word.in <- "kdin"
word.out <- "kunde"

replfxn(x[[2]],word.in,word.out)

lapply(x,replfxn,word.in=word.in,word.out=word.out)
[[1]]
[1] "lorem" "ipsum" "li"    "ld"    "ee"    "wö"    "wo"    "di"    "dd"   

[[2]]
[1] "la"      "kunde"   "di"      "da"      "dogs"    "chicken"

[[3]]
[1] "kd"   "good" "i"    "need" "some" "help"

For a large number of words to search over, I'd guess this is faster than regexes. It's also more amenable to data-code separation, since it lends itself to writing a merge or similar function to read in the dictionary from a file rather than embedding it in code.

If you really need it back in the original format (as a space-separated character vector), you can apply a paste to the result.

And here are timing results. I stand corrected: looks like gsub is faster!

library(microbenchmark)
microbenchmark(
  gsub( word.in , word.out , a1) ,
  lapply(x,replfxn,word.in=word.in,word.out=word.out) ,
  times = 1000
  )

                                                        expr    min     lq
1                                gsub(word.in, word.out, a1)  42772  44484
2 lapply(x, replfxn, word.in = word.in, word.out = word.out) 102653 106075
  median       uq    max
1  47905  48761.0 691193
2 109496 111635.5 970065

回复收藏 0 原文

~没有更多了~