从数据库中消除重复的城市
背景
超过 5300 个重复行:
"id","latitude","longitude","country","region","city"
"2143220","41.3513889","68.9444444","KZ","10","Abay"
"2143218","40.8991667","68.5433333","KZ","10","Abay"
"1919381","33.8166667","49.6333333","IR","34","Ab Barik"
"1919377","35.6833333","50.1833333","IR","19","Ab Barik"
"1919432","29.55","55.5122222","IR","29","`Abbasabad"
"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919413","28.0011111","58.9005556","IR","12","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"
"1919433","31.8988889","58.9211111","IR","30","`Abbasabad"
"1919422","33.8666667","48.3","IR","23","`Abbasabad"
"1919420","33.4658333","49.6219444","IR","23","`Abbasabad"
"1919438","33.5333333","49.9833333","IR","34","`Abbasabad"
"1919423","33.7619444","49.0747222","IR","24","`Abbasabad"
"1919419","34.2833333","49.2333333","IR","19","`Abbasabad"
"1919439","35.8833333","52.15","IR","35","`Abbasabad"
"1919417","35.9333333","52.95","IR","17","`Abbasabad"
"1919427","35.7341667","51.4377778","IR","26","`Abbasabad"
"1919425","35.1386111","51.6283333","IR","26","`Abbasabad"
"1919713","30.3705556","56.07","IR","29","`Abdolabad"
"1919711","27.9833333","57.7244444","IR","29","`Abdolabad"
"1919716","35.6025","59.2322222","IR","30","`Abdolabad"
"1919714","34.2197222","56.5447222","IR","30","`Abdolabad"
其他详细信息:
- PostgreSQL 8.4 数据库
- Linux
问题
有些值是明显重复的(“Abay”是因为区域匹配,“Ab Barik”是因为两个位置非常接近),其他值则不那么明显(并且甚至可能不是实际的重复项):
"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"
目标是消除所有重复项。
问题
给定一个值表,例如上述 CSV 数据:
- 您将如何消除重复项?
- 您会使用哪些以地理为中心的 PostgreSQL 函数?
- 您还会使用什么其他标准来哄骗重复项?
更新
半工作示例代码以选择同一国家/地区内距离很近(10 公里以内)的重复城市名称:
select
c1.country, c1.name, c1.region_id, c2.region_id, c1.latitude_decimal, c1.longitude_decimal, c2.latitude_decimal, c2.longitude_decimal
from
climate.maxmind_city c1,
climate.maxmind_city c2
where
c1.country = 'BE' and
c1.id <> c2.id and
c1.country = c2.country and
c1.name = c2.name and
(c1.latitude_decimal <> c2.latitude_decimal or c1.longitude_decimal <> c2.longitude_decimal) and
earth_distance(
ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 10
order by
country, name
想法
两阶段方法:
- 通过删除 min( 来消除明显的重复项(相同的国家/地区、地区和城市名称) ID)。
- 消除那些彼此距离很近、具有相同名字和国家的人。这可能会消除一些合法城市,但几乎不会产生任何后果。
谢谢你!
Background
Over 5300 duplicate rows:
"id","latitude","longitude","country","region","city"
"2143220","41.3513889","68.9444444","KZ","10","Abay"
"2143218","40.8991667","68.5433333","KZ","10","Abay"
"1919381","33.8166667","49.6333333","IR","34","Ab Barik"
"1919377","35.6833333","50.1833333","IR","19","Ab Barik"
"1919432","29.55","55.5122222","IR","29","`Abbasabad"
"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919413","28.0011111","58.9005556","IR","12","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"
"1919433","31.8988889","58.9211111","IR","30","`Abbasabad"
"1919422","33.8666667","48.3","IR","23","`Abbasabad"
"1919420","33.4658333","49.6219444","IR","23","`Abbasabad"
"1919438","33.5333333","49.9833333","IR","34","`Abbasabad"
"1919423","33.7619444","49.0747222","IR","24","`Abbasabad"
"1919419","34.2833333","49.2333333","IR","19","`Abbasabad"
"1919439","35.8833333","52.15","IR","35","`Abbasabad"
"1919417","35.9333333","52.95","IR","17","`Abbasabad"
"1919427","35.7341667","51.4377778","IR","26","`Abbasabad"
"1919425","35.1386111","51.6283333","IR","26","`Abbasabad"
"1919713","30.3705556","56.07","IR","29","`Abdolabad"
"1919711","27.9833333","57.7244444","IR","29","`Abdolabad"
"1919716","35.6025","59.2322222","IR","30","`Abdolabad"
"1919714","34.2197222","56.5447222","IR","30","`Abdolabad"
Additional details:
- PostgreSQL 8.4 Database
- Linux
Problem
Some values are obvious duplicates ("Abay" because the regions match and "Ab Barik" because the two locations are within such close proximity), others are not so obvious (and might not even be actual duplicates):
"1919430","27.4263889","57.5725","IR","29","`Abbasabad"
"1919435","36.5641667","61.14","IR","30","`Abbasabad"
The goal is to eliminate all duplicates.
Questions
Given a table of values such as the above CSV data:
- How would you eliminate duplicates?
- What geo-centric PostgreSQL functions would you use?
- What other criteria would you use to wheedle down the duplicates?
Update
Semi-working example code to select duplicate city names within the same country that are in close proximity (within 10 km):
select
c1.country, c1.name, c1.region_id, c2.region_id, c1.latitude_decimal, c1.longitude_decimal, c2.latitude_decimal, c2.longitude_decimal
from
climate.maxmind_city c1,
climate.maxmind_city c2
where
c1.country = 'BE' and
c1.id <> c2.id and
c1.country = c2.country and
c1.name = c2.name and
(c1.latitude_decimal <> c2.latitude_decimal or c1.longitude_decimal <> c2.longitude_decimal) and
earth_distance(
ll_to_earth( c1.latitude_decimal, c1.longitude_decimal ),
ll_to_earth( c2.latitude_decimal, c2.longitude_decimal ) ) <= 10
order by
country, name
Ideas
Two phase approach:
- Eliminate the obvious duplicates (same country, region, and city name) by removing the min(id).
- Eliminate those within close proximity of each other, having the same name and country. This could remove some legitimate cities, but hardly any of consequence.
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
查找重复项很简单:
添加代码以基于此删除重复项很简单:
请注意删除查询中缺少having。
Finding duplicates is simple:
Adding code to remove duplicates based on this is simple:
note lack of having in the delete query.
这将删除与同一国家/地区同名城市非常接近的第二个城市:
This deletes the second city within close proximity to a city of the same name in the same country:
如果您的数据已通过 CSV 文件并使用代码 (PHP) 导入,那么您可以使用 PHP 代码中的放置条件来防止重复输入。如果您插入的城市已经存在,则使循环继续到下一条记录并跳过当前记录。
如果您按照这种方式将数据导入数据库,请尝试此操作。
谢谢。
if your data have been imported thru CSV files and with the code (PHP) then you can prevent duplicates entry with the putting condition in PHP code. if the city you inserted is already exist then make loop continue to next record and skip current record.
try this if you are follow this way to import data in database..
Thanks.