分组以提取半结构化数据中的共同值

发布于 2024-08-12 16:11:19 字数 896 浏览 2 评论 0原文

我在数据库中有一个“有点”难看的字段，其中包含位置名称。例如，麦迪逊广场花园也被输入为“麦迪逊广场花园”等等。

我正在尝试提取数据，以便获得所有位置的准确列表。为了实现这一点，我所做的是创建一个 sql 查询，在其中加入每个位置的事件，然后按位置名称进行分组，并且仅使用具有超过 10 个条目的位置组（过滤掉一些非可靠的条目），但我最终仍然得到一些非常不同的拼写和条目，导致重复的属性/位置。

我的 SQL 查询如下所示

"SELECT location, COUNT(*) FROM locations 
JOIN event ON locations.lid=events.lid
WHERE `long`
BETWEEN - 74.419382608696
AND - 73.549817391304
AND lat
BETWEEN 40.314017391304
AND 41.183582608696
GROUP BY location 
HAVING COUNT(*)>10

运行此查询提供 3 个不同的条目“Madison Square Garden”、“Madison Square Gardens”、“The Madison Square Garden”。当然，这仅适用于麦迪逊广场花园入口。大多数条目都有多个略有不同的拼写。

我按纬度/经度限制搜索，因此不会将不同城市中具有相同名称的位置组合在一起。

有没有办法使用正则表达式或 GROUP 子句中的某些内容来使这些内容一致地分组？即使只是删除分组之前的尾随“s”和“the”也可能会带来很大的好处。

我打算获取每个结果，然后对纬度/经度范围内的所有位置进行正则表达式匹配？

幸运的是，我有足够多的与地点相关的事件，所以我在某种程度上能够识别主要地点。

对于从半结构化数据中提取位置还有其他建议吗？数据是从各种来源废弃的，因此我无法控制输入。

原文

I've got a 'somewhat' ugly field in a database which holds the names of locations. For instance, Madison Square Gardens which has also been entered as "The Madison Square Gardens", etc. etc.

I'm trying to extract the data so that I can get an accurate list of all the locations.
In order to accomplish this, what I've done is created a sql query where I join the events for each location, and then group by the location name and only use location groups having more than 10 entries (that filters out the somewhat non-reliable entries), but I still end up with Some very different spellings and entries, resulting in duplicate properties/locations.

My SQL query looks like this

"SELECT location, COUNT(*) FROM locations 
JOIN event ON locations.lid=events.lid
WHERE `long`
BETWEEN - 74.419382608696
AND - 73.549817391304
AND lat
BETWEEN 40.314017391304
AND 41.183582608696
GROUP BY location 
HAVING COUNT(*)>10

Running this query provides 3 different entries "Madison Square Garden", "Madison Square Gardens", "The Madison Square Garden". Of course, this is only for the Madison Square Garden entry. Most entries have multiple slightly different spellings.

I restrict my searches by lat/long so I don't get locations with the same name in different cities grouped together.

Is there a way with Regular expressions or something in the GROUP clause to have these grouped consistently? Even just removing the trailing 's', and 'the' before the grouping would probably be a big benefit.

I was going to take each result and then do a regular expression match against all the locations in within the lat/long range?

Fortunately I have enough linked events to locations, that I am somewhat able to recognize the major locations.

Any other suggestions for extracting locations from semi-structured data?
The data is scrapped from a variety of sources, so I don't have control over the input.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挽容 2024-08-19 16:11:19

以下是给您的一些建议。

在数据库中创建标准化的场地名称列：
(1) 通过一些简单的转换来运行每个名称......
将“麦迪逊广场花园”和“华盛顿纪念碑”变成“麦迪逊广场花园”和“华盛顿纪念碑”
将复数名词变成单数的简单方法...从你名字中的每个单词中去掉“es”，然后去掉“s”。
把所有东西都小写。
消除任何剩余的短词“a”“it”“the”“and”“&”你明白了。
按字母顺序对单词进行排序，得到“花园麦迪逊广场”
将生成的字符串存储到表中的新列中。
匹配它，同时仍然显示原始字符串。

(2) 创建一个包含场地不同拼写的查找表。这对于“波士顿花园”/“舰队中心”/“TD Banknorth 花园”/“北站”等场地名称以及类似的垃圾名称非常有效。一样的地方，不同的写法。（以“宾夕法尼亚车站”为例）。

(3) 您可以使用雅虎或谷歌地图地理编码服务，该服务会将不完整的名称和地址标准化。

Soundex 将为您提供相当多的误报匹配。它被设计为后备方案，需要人工消除歧义。