替换 R data.frames 中的单词(文本挖掘)
我正在使用 SQL 和 R 开发文本挖掘解决方案。
首先,我从 SQL 选择中将数据导入到 R 中,然后用它进行数据挖掘。
这是我得到的:
rawData = sqlQuery(dwhConnect,sqlString)
a = data.frame(rawData$ENNOTE_NEU)
如果我这样做,
a[[1]][1:3]
您会看到结构:
[1] lorem ipsum li ld ee wö wo di dd
[2] la kdin di da dogs chicken
[3] kd good i need some help
现在我想用我自己的字典进行一些数据清理。 一个例子是将 li 替换为 lorem ipsum 和 kd 以及 kdin 替换为 kunde< /strong>
我的问题是如何对整个数据框执行此操作。
for(i in 1:(nrow(a)))
{
a[[1]][i]=gsub( " kd | kdin " , " kunde " ,a[[1]][i])
a[[1]][i]=gsub( " li " , " lorem ipsum " ,a[[1]][i])
...
}
有效,但对于大量数据来说速度很慢。
有更好的方法吗?
欢呼队长
I'm working on a Text Mining Solution with SQL and R.
First I Import Data into R from my SQL selection and than I do data mining stuff with it.
Here is what I got:
rawData = sqlQuery(dwhConnect,sqlString)
a = data.frame(rawData$ENNOTE_NEU)
If I do a
a[[1]][1:3]
you see the structure:
[1] lorem ipsum li ld ee wö wo di dd
[2] la kdin di da dogs chicken
[3] kd good i need some help
Now I want to do some data cleaning with my own dictionary.
An Example would be to replace li with lorem ipsum and kd as well as kdin with kunde
My Problem is how to do that for the whole Data Frame.
for(i in 1:(nrow(a)))
{
a[[1]][i]=gsub( " kd | kdin " , " kunde " ,a[[1]][i])
a[[1]][i]=gsub( " li " , " lorem ipsum " ,a[[1]][i])
...
}
works but is slow for a lot of data.
Is there a better way to do that?
cheers The Captain
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
gsub
是矢量化的,因此您不需要循环。更快。
另外,您确定要在正则表达式中添加空格吗?这样您就不会匹配行首或行尾的单词。
gsub
is vectorised, so you don't need the loop.is quicker.
Also, are you sure you want spaces inside your regexes? That way you won't match words at the start or end of lines.
替代方法:完全避免正则表达式。当您有很多不同的单词要搜索时,此方法效果最佳,因为除了第一次之外,您将避免文本操作。
对于要搜索的大量单词,我猜这比正则表达式更快。它也更适合数据代码分离,因为它适合编写合并或类似的函数来从文件中读取字典,而不是将其嵌入代码中。
如果您确实需要将其恢复为原始格式(作为空格分隔的字符向量),您可以对结果应用
粘贴
。这是计时结果。我纠正了:看起来 gsub 更快!
Alternative approach: avoid regexes altogether. This works best when you have a lot of different words to search, because you'll avoid the text manipulation except for the first time.
For a large number of words to search over, I'd guess this is faster than regexes. It's also more amenable to data-code separation, since it lends itself to writing a merge or similar function to read in the dictionary from a file rather than embedding it in code.
If you really need it back in the original format (as a space-separated character vector), you can apply a
paste
to the result.And here are timing results. I stand corrected: looks like gsub is faster!