在 R 中,如何用另一个字符串替换包含特定模式的字符串?
我正在开展一个涉及清理大学专业数据列表的项目。我发现很多拼写错误,因此我希望使用函数 gsub()
将拼写错误的拼写替换为正确的拼写。例如,假设“biolgy”在名为 Major 的专业列表中拼写错误。如何让 R 检测拼写错误并将其替换为正确的拼写?我尝试过 gsub('biol', 'Biology', Major) 但它只替换了 'biolgy' 中的前四个字母。如果我执行 gsub('biolgy', 'Biology', Major),它仅适用于这种情况,但不会检测到其他形式的“biology”拼写错误。
谢谢你!
I'm working on a project involving cleaning a list of data on college majors. I find that a lot are misspelled, so I was looking to use the function gsub()
to replace the misspelled ones with its correct spelling. For example, say 'biolgy' is misspelled in a list of majors called Major. How can I get R to detect the misspelling and replace it with its correct spelling? I've tried gsub('biol', 'Biology', Major)
but that only replaces the first four letters in 'biolgy'. If I do gsub('biolgy', 'Biology', Major)
, it works for that case alone, but that doesn't detect other forms of misspellings of 'biology'.
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您应该定义一些漂亮的正则表达式,或者使用
base
包中的agrep
。stringr
包是另一种选择,我知道人们使用它,但我非常喜欢正则表达式,所以这对我来说是禁忌。无论如何, agrep 应该可以解决这个问题:
编辑:
您还应该使用ignore.case = TRUE ,但准备好“手工”做一些簿记......
You should either define some nifty regular expression, or use
agrep
frombase
package.stringr
package is another option, I know that people use it, but I'm a very huge fan of regular expressions, so it's a no-no for me.Anyway,
agrep
should do the trick:EDIT:
You should also use
ignore.case = TRUE
, but be prepared to do some bookkeeping "by hand"...您可以设置所有可能的拼写错误的向量,然后对 gsub 调用进行循环。比如:
如果你想做一些更聪明的事情,看看 CRAN 上是否有任何模糊匹配包,或者使用“soundex”匹配的东西......
大约的维基百科页面。字符串匹配可能有用,并尝试在 R-help 中搜索一些关键术语。
http://en.wikipedia.org/wiki/Approximate_string_matching
You can set up a vector of all the possible misspellings and then do a loop over a gsub call. Something like:
If you want to do something smarter, see if there's any fuzzy matching packages on CRAN, or something that uses 'soundex' matching....
The wikipedia page on approx. string matching might be useful, and try searching R-help for some of the key terms.
http://en.wikipedia.org/wiki/Approximate_string_matching
您可以首先将专业与可用专业列表进行匹配,任何不匹配的地方都可能是拼写错误。然后使用 agrep 函数再次将它们与已知的专业进行匹配(agrep 进行近似匹配,因此如果它与正确值相似,那么您将获得匹配)。
You could first match the majors against a list of available majors, any not matching would then be the likely missspellings. Then use the agrep function to match these against the known majors again (agrep does approximate matching, so if it is similar to a correct value then you will get a match).
vwr 包具有字符串匹配的方法:
http://ftp.heanet.ie/mirrors/cran.r-project.org/web/packages/vwr/index.html
所以你最好的选择可能是使用与可能的主题字符串:
如果您得到相同的最小值,则掷硬币:
The vwr package has methods for string matching:
http://ftp.heanet.ie/mirrors/cran.r-project.org/web/packages/vwr/index.html
so your best bet might be to use the string with the minimum Levenshtein distance from the possible subject strings:
If you get identical minima then flip a coin:
示例 1a) perl/linux 正则表达式:
's/oldstring/newstring/'
示例 1b) 与 1a 等效的 R:
srcstring=sub(oldstring, newstring, srcstring)
示例 2a ) perl/linux 正则表达式:
's/oldstring//'
示例 2b) 2a 的 R 等效项:
srcstring=sub(oldstring, "", srcstring)
example 1a) perl/linux regex:
's/oldstring/newstring/'
example 1b) R equivalent of 1a:
srcstring=sub(oldstring, newstring, srcstring)
example 2a) perl/linux regex:
's/oldstring//'
example 2b) R equivalent of 2a:
srcstring=sub(oldstring, "", srcstring)