在 R 中,如何用另一个字符串替换包含特定模式的字符串?

发布于 2024-10-21 22:03:23 字数 292 浏览 9 评论 0原文

我正在开展一个涉及清理大学专业数据列表的项目。我发现很多拼写错误,因此我希望使用函数 gsub() 将拼写错误的拼写替换为正确的拼写。例如,假设“biolgy”在名为 Major 的专业列表中拼写错误。如何让 R 检测拼写错误并将其替换为正确的拼写?我尝试过 gsub('biol', 'Biology', Major) 但它只替换了 'biolgy' 中的前四个字母。如果我执行 gsub('biolgy', 'Biology', Major),它仅适用于这种情况,但不会检测到其他形式的“biology”拼写错误。

谢谢你!

I'm working on a project involving cleaning a list of data on college majors. I find that a lot are misspelled, so I was looking to use the function gsub() to replace the misspelled ones with its correct spelling. For example, say 'biolgy' is misspelled in a list of majors called Major. How can I get R to detect the misspelling and replace it with its correct spelling? I've tried gsub('biol', 'Biology', Major) but that only replaces the first four letters in 'biolgy'. If I do gsub('biolgy', 'Biology', Major), it works for that case alone, but that doesn't detect other forms of misspellings of 'biology'.

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

儭儭莪哋寶赑 2024-10-28 22:03:23

您应该定义一些漂亮的正则表达式,或者使用 base 包中的 agrepstringr 包是另一种选择,我知道人们使用它,但我非常喜欢正则表达式,所以这对我来说是禁忌。

无论如何, agrep 应该可以解决这个问题:

agrep("biol", "biology")
[1] 1
agrep("biolgy", "biology")
[1] 1

编辑:

您还应该使用ignore.case = TRUE ,但准备好“手工”做一些簿记......

You should either define some nifty regular expression, or use agrep from base package. stringr package is another option, I know that people use it, but I'm a very huge fan of regular expressions, so it's a no-no for me.

Anyway, agrep should do the trick:

agrep("biol", "biology")
[1] 1
agrep("biolgy", "biology")
[1] 1

EDIT:

You should also use ignore.case = TRUE, but be prepared to do some bookkeeping "by hand"...

眼前雾蒙蒙 2024-10-28 22:03:23

您可以设置所有可能的拼写错误的向量,然后对 gsub 调用进行循环。比如:

biologySp = c("biolgy","biologee","bologee","bugs")

for(sp in biologySp){
  Major = gsub(sp,"Biology",Major)
}

如果你想做一些更聪明的事情,看看 CRAN 上是否有任何模糊匹配包,或者使用“soundex”匹配的东西......

大约的维基百科页面。字符串匹配可能有用,并尝试在 R-help 中搜索一些关键术语。

http://en.wikipedia.org/wiki/Approximate_string_matching

You can set up a vector of all the possible misspellings and then do a loop over a gsub call. Something like:

biologySp = c("biolgy","biologee","bologee","bugs")

for(sp in biologySp){
  Major = gsub(sp,"Biology",Major)
}

If you want to do something smarter, see if there's any fuzzy matching packages on CRAN, or something that uses 'soundex' matching....

The wikipedia page on approx. string matching might be useful, and try searching R-help for some of the key terms.

http://en.wikipedia.org/wiki/Approximate_string_matching

紫﹏色ふ单纯 2024-10-28 22:03:23

您可以首先将专业与可用专业列表进行匹配,任何不匹配的地方都可能是拼写错误。然后使用 agrep 函数再次将它们与已知的专业进行匹配(agrep 进行近似匹配,因此如果它与正确值相似,那么您将获得匹配)。

You could first match the majors against a list of available majors, any not matching would then be the likely missspellings. Then use the agrep function to match these against the known majors again (agrep does approximate matching, so if it is similar to a correct value then you will get a match).

归属感 2024-10-28 22:03:23

vwr 包具有字符串匹配的方法:

http://ftp.heanet.ie/mirrors/cran.r-project.org/web/packages/vwr/index.html

所以你最好的选择可能是使用与可能的主题字符串:

> levenshtein.distance("physcs",c("biology","physics","geography"))
  biology   physics geography 
        7         1         9 

如果您得到相同的最小值,则掷硬币:

> levenshtein.distance("biolsics",c("biology","physics","geography"))
  biology   physics geography 
        4         4         8 

The vwr package has methods for string matching:

http://ftp.heanet.ie/mirrors/cran.r-project.org/web/packages/vwr/index.html

so your best bet might be to use the string with the minimum Levenshtein distance from the possible subject strings:

> levenshtein.distance("physcs",c("biology","physics","geography"))
  biology   physics geography 
        7         1         9 

If you get identical minima then flip a coin:

> levenshtein.distance("biolsics",c("biology","physics","geography"))
  biology   physics geography 
        4         4         8 
丶视觉 2024-10-28 22:03:23

示例 1a) perl/linux 正则表达式:'s/oldstring/newstring/'

示例 1b) 与 1a 等效的 R:srcstring=sub(oldstring, newstring, srcstring)

示例 2a ) perl/linux 正则表达式:'s/oldstring//'

示例 2b) 2a 的 R 等效项:srcstring=sub(oldstring, "", srcstring)

example 1a) perl/linux regex: 's/oldstring/newstring/'

example 1b) R equivalent of 1a: srcstring=sub(oldstring, newstring, srcstring)

example 2a) perl/linux regex: 's/oldstring//'

example 2b) R equivalent of 2a: srcstring=sub(oldstring, "", srcstring)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文