使用 R 中的模糊/近似字符串匹配合并两个数据框
描述
我有两个数据集,其中包含需要合并的信息。我拥有的唯一公共字段是不完全匹配的字符串和可能有很大不同的数字字段
解释问题的唯一方法是向您显示数据。这是 a.csv 和 b.csv。我正在尝试将 B 合并到 A。B
中有三个字段,A 中有四个字段。公司名称(仅限文件 A)、基金名称、资产类别和资产。到目前为止,我的重点是尝试通过替换单词或部分字符串来创建精确匹配,然后使用:来匹配基金名称,
a <- read.table(file = "http://bertelsen.ca/R/a.csv",header=TRUE, sep=",", na.strings=F, strip.white=T, blank.lines.skip=F, stringsAsFactors=T)
b <- read.table(file = "http://bertelsen.ca/R/b.csv",header=TRUE, sep=",", na.strings=F, strip.white=T, blank.lines.skip=F, stringsAsFactors=T)
merge(a,b, by="Fund.Name")
但是,这只能使我达到大约 30% 的匹配率。剩下的事情我必须手工完成。
资产是一个数字字段,两者并不总是正确的,如果基金资产较低,则可能会有很大差异。资产类别是一个字符串字段,在两个文件中“通常”相同,但存在差异。
文件 B 中不同系列的基金让情况变得更加复杂。例如:
AGF 加拿大价值
AGF 加拿大价值-D
在这些情况下,我必须选择不连续的那一项,或者选择称为“A”、“-A”或“Advisor”的那一项作为比赛。
问题
您认为最好的方法是什么?这项练习是我每月必须做的事情,手动匹配它们非常耗时。代码示例将很有帮助。
IDEAS
我认为可能有效的一种方法是根据字符串中每个单词的第一个大写字母规范字符串。但我一直无法弄清楚如何使用 R 来实现这一点。
我考虑的另一种方法是根据资产、基金名称、资产类别和公司的组合创建匹配索引。但同样,我不确定如何使用 R 来做到这一点。或者,就此而言,如果可能的话。
非常感谢代码、评论、想法和方向的示例!
DESCRIPTION
I have two datasets with information that I need to merge. The only common fields that I have are strings that do not perfectly match and a numerical field that can be substantially different
The only way to explain the problem is to show you the data. Here is a.csv and b.csv. I am trying to merge B to A.
There are three fields in B and four in A. Company Name (File A Only), Fund Name, Asset Class, and Assets. So far, my focus has been on attempting to match the Fund Names by replacing words or parts of the strings to create exact matches and then using:
a <- read.table(file = "http://bertelsen.ca/R/a.csv",header=TRUE, sep=",", na.strings=F, strip.white=T, blank.lines.skip=F, stringsAsFactors=T)
b <- read.table(file = "http://bertelsen.ca/R/b.csv",header=TRUE, sep=",", na.strings=F, strip.white=T, blank.lines.skip=F, stringsAsFactors=T)
merge(a,b, by="Fund.Name")
However, this only brings me to about 30% matching. The rest I have to do by hand.
Assets is a numerical field that is not always correct in either and can vary wildly if the fund has low assets. Asset Class is a string field that is "generally" the same in both files, however, there are discrepancies.
Adding to the complication are the different series of funds, in File B. For example:
AGF Canadian Value
AGF Canadian Value-D
In these cases, I have to choose the one that is not seried, or choose the one that is called "A", "-A", or "Advisor" as the match.
QUESTION
What would you say is the best approach? This excercise is something that I have to do on a monthly basis and matching them manually is incredibly time consuming. Examples of code would be instrumental.
IDEAS
One method that I think may work is normalizing the strings based on the first capitalized letter of each word in the string. But I haven't been able to figure out how to pull that off using R.
Another method I considered was creating an index of matches based on a combination of assets, fund name, asset class and company. But again, I'm not sure how to do this with R. Or, for that matter, if it's even possible.
Examples of code, comments, thoughts and direction are greatly appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
强烈建议使用 dgrtwo/fuzzyjoin 包。
<代码>
stringdist_inner_join(a,b, by="基金名称")
It's highly recommended to use the dgrtwo/fuzzyjoin package.
stringdist_inner_join(a,b, by="Fund.Name")
一个快速建议:在使用合并之前尝试分别对不同字段进行一些匹配。最简单的方法是使用
pmatch
函数,尽管 R 并不缺少文本匹配函数(例如agrep
)。这是一个简单的示例:对于您的数据集,这会匹配
a
中的所有基金名称:创建匹配项后,您可以使用这些匹配项轻松地将它们合并在一起。
One quick suggestion: try to do some matching on the different fields separately before using merge. The simplest approach is with the
pmatch
function, although R has no shortage of text matching functions (e.g.agrep
). Here's a simple example:For your dataset, this matches all the fund names out of
a
:Once you create matches, you can easily merge them together using those instead.
近似字符串匹配不是一个好主意,因为不正确的匹配会使整个分析无效。如果每个来源的名称每次都相同,那么构建索引对我来说似乎也是最好的选择。这在 R 中很容易完成:
假设您有数据:
一次为每个源构建一个名称索引,也许使用 pmatch 等作为起点,然后手动验证。
然后对于每次运行合并使用:
这将为我们提供:
Approximate string matching is not a good idea since an incorrect match would invalidate the whole analysis. If the names from each source is the same each time, then building indexes seems the best option to me too. This is easily done in R:
Suppose you have the data:
Build an index of names for each source one time, perhaps using pmatch etc. as a starting point and then validating manually.
Then for each run merge using:
Which would give us:
我也是加拿大本地人,认识基金名称。
这是一项困难的工作,因为每个数据提供者都会为各个基金名称选择自己的形式。有些使用不同的结构,例如全部结束于基金或类别,其他则遍布各处。每个人似乎也选择了自己的短片,并且这些短片会定期变化。
这就是为什么像您这样的许多人定期手动执行此操作。一些咨询公司确实列出了链接各种来源的索引,不确定您是否探索过这条路线?
正如 Shane 和 Marek 指出的那样,这是一个匹配任务,而不是直接连接。许多公司都在这一问题上苦苦挣扎。我正在做这件事……
杰伊
I'm a Canada local as well, recognize the fund names.
This is a difficult one as each of the data providers picks their own form for the individual fund names. Some use different structure like all end in either Fund or Class others are all over the place. Each seems to choose their own short-forms as well and these change regularly.
That's why so many people like you are doing this by hand on a regular basis. Some of the consulting firms do list indexes to link various sources, not sure if you've explored that route?
As Shane and Marek pointed out this is a matching task more than a straight join. Many companies are struggling with this one. I'm in the middle of my work on this...
Jay