分组以提取半结构化数据中的共同值
我在数据库中有一个“有点”难看的字段,其中包含位置名称。例如,麦迪逊广场花园也被输入为“麦迪逊广场花园”等等。
我正在尝试提取数据,以便获得所有位置的准确列表。 为了实现这一点,我所做的是创建一个 sql 查询,在其中加入每个位置的事件,然后按位置名称进行分组,并且仅使用具有超过 10 个条目的位置组(过滤掉一些非可靠的条目),但我最终仍然得到一些非常不同的拼写和条目,导致重复的属性/位置。
我的 SQL 查询如下所示
"SELECT location, COUNT(*) FROM locations JOIN event ON locations.lid=events.lid WHERE `long` BETWEEN - 74.419382608696 AND - 73.549817391304 AND lat BETWEEN 40.314017391304 AND 41.183582608696 GROUP BY location HAVING COUNT(*)>10
运行此查询提供 3 个不同的条目“Madison Square Garden”、“Madison Square Gardens”、“The Madison Square Garden”。当然,这仅适用于麦迪逊广场花园入口。大多数条目都有多个略有不同的拼写。
我按纬度/经度限制搜索,因此不会将不同城市中具有相同名称的位置组合在一起。
有没有办法使用正则表达式或 GROUP 子句中的某些内容来使这些内容一致地分组?即使只是删除分组之前的尾随“s”和“the”也可能会带来很大的好处。
我打算获取每个结果,然后对纬度/经度范围内的所有位置进行正则表达式匹配?
幸运的是,我有足够多的与地点相关的事件,所以我在某种程度上能够识别主要地点。
对于从半结构化数据中提取位置还有其他建议吗? 数据是从各种来源废弃的,因此我无法控制输入。
I've got a 'somewhat' ugly field in a database which holds the names of locations. For instance, Madison Square Gardens which has also been entered as "The Madison Square Gardens", etc. etc.
I'm trying to extract the data so that I can get an accurate list of all the locations.
In order to accomplish this, what I've done is created a sql query where I join the events for each location, and then group by the location name and only use location groups having more than 10 entries (that filters out the somewhat non-reliable entries), but I still end up with Some very different spellings and entries, resulting in duplicate properties/locations.
My SQL query looks like this
"SELECT location, COUNT(*) FROM locations JOIN event ON locations.lid=events.lid WHERE `long` BETWEEN - 74.419382608696 AND - 73.549817391304 AND lat BETWEEN 40.314017391304 AND 41.183582608696 GROUP BY location HAVING COUNT(*)>10
Running this query provides 3 different entries "Madison Square Garden", "Madison Square Gardens", "The Madison Square Garden". Of course, this is only for the Madison Square Garden entry. Most entries have multiple slightly different spellings.
I restrict my searches by lat/long so I don't get locations with the same name in different cities grouped together.
Is there a way with Regular expressions or something in the GROUP clause to have these grouped consistently? Even just removing the trailing 's', and 'the' before the grouping would probably be a big benefit.
I was going to take each result and then do a regular expression match against all the locations in within the lat/long range?
Fortunately I have enough linked events to locations, that I am somewhat able to recognize the major locations.
Any other suggestions for extracting locations from semi-structured data?
The data is scrapped from a variety of sources, so I don't have control over the input.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
以下是给您的一些建议。
在数据库中创建标准化的场地名称列:
(1) 通过一些简单的转换来运行每个名称......
将“麦迪逊广场花园”和“华盛顿纪念碑”变成“麦迪逊广场花园”和“华盛顿纪念碑”
将复数名词变成单数的简单方法...从你名字中的每个单词中去掉“es”,然后去掉“s”。
把所有东西都小写。
消除任何剩余的短词“a”“it”“the”“and”“&”你明白了。
按字母顺序对单词进行排序,得到“花园麦迪逊广场”
将生成的字符串存储到表中的新列中。
匹配它,同时仍然显示原始字符串。
(2) 创建一个包含场地不同拼写的查找表。这对于“波士顿花园”/“舰队中心”/“TD Banknorth 花园”/“北站”等场地名称以及类似的垃圾名称非常有效。一样的地方,不同的写法。 (以“宾夕法尼亚车站”为例)。
(3) 您可以使用雅虎或谷歌地图地理编码服务,该服务会将不完整的名称和地址标准化。
Soundex 将为您提供相当多的误报匹配。它被设计为后备方案,需要人工消除歧义。
Here are some suggestions for you.
Create a normalized venue-name column in your data base:
(1) Run each name through some simple transformations ...
Turn "The Madison Square Garden" and "The Washington Monument" into "Madison Square Garden" and "Washington Monument"
Turn plural nouns into singular the easy way ... strip "es", then "s" from each word in your name.
Downcase everything.
Eliminate any remaining short words "a" "it" "the" "and" "&" you get the idea.
Sort your words into alphabetical order, getting you "garden madison square"
Store that resulting string into a new column in your table.
Match on it, while still displaying your original string.
(2) Create a lookup table with variant spellings of venues. This works well for venue names like "Boston Garden" / "Fleet Center" / "TD Banknorth Garden" / "North Station" and junk like that. Same place, different spelling. ("Penn Station" for your example).
(3) You could use the Yahoo or Google Maps geocoding services, which will take incomplete names and addresses and standardize them.
Soundex is going to get you quite a few false positive matches. It's designed as a fallback and requires human disambiguation.
如果您的问题是以相同方式处理“相似”字符串,您可能需要检查 SOUNDEX 算法。我不确定它是否适用于您所有的不同场景,但这是一个开始。
此线程中对此进行了讨论:
如何将 MYSQL 中的公司名称与 PHP 进行模糊匹配以实现自动完成?
If your issue is treating "similar" strings the same, you may want to check out the SOUNDEX algorithm. I'm not sure if it will work for all of your different scenarios, but it's a start.
It's discussed in this thread:
How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?