地址/邮政编码和电话的部分/匹配的复杂查询帮助
示例数据:
$address_1 = '123 Main St.';
$address_2 = 'Suite 200';
$phone = '1235551212';
$zip = '12345';
示例数据库:
record_id, address_1, address_2, zip, phone
123, '123 main street', '', '12345', '1234567890'
124, '500 E. Ninja Road', 'Suite #200', '12345-1111', '(321)654-0987'
125, '222 where 4 east circle', 'P.O. Box 3', '11111', '1-123-555-1212'
这可以是单独的查询,但我需要搜索地址和邮政编码中的部分匹配项以及电话中的完整匹配项。
所以 $address_1
会将 record_id
123 部分匹配为“123 Main St.”是“123 main street”的变体$address_2
将匹配 record_id
124,因为“Suite 200”是“Suite #200”的变体$phone
将匹配 record_id
125,因为“1235551212”是“1-123-555-1212”的变体$zip
将匹配 record_id
123 & 124 作为“12345”是“12345-1111”的变体并且与“12345”匹配
注意: 此外,这些值可以切换,这意味着 $address_1
可以采用如下格式:“123 main street”,而 record_id
123 则可以采用以下格式:“123 Main St.” (这适用于所有字段)
建议我尝试 喜欢,喜欢、类似、CITEXT 和 FTS(自由文本搜索) 所有这些都很棒,但我不知道如何实施它们以获得我想要的结果。 我不介意为每个查询运行多个查询,例如查询 $address_1
匹配,另一个查询 $address_2
匹配,等等。我也知道会有误报和漏报,但我希望准确率能达到 75% 左右(或更好)。
一个重要注意事项是 Postgres 服务器正在运行版本 7.4,并且没有升级计划。
另外,为了增加查询的复杂性,还有多个 address_1
、address_2
、zip
和 phone
(考虑单独的地址/phone(例如家庭和办公室))
这是我解决此问题的第一次尝试:
我有一个想法,生成最常见的格式,然后传递这些格式作为查询中的参数。
类似的东西:
$address_1 = array(
'123 Main St.', // original
'123 main st.', // lower case
'123 Main St.', // First Letter Upper Case
'123 MAIN ST.', // ALL Upper Case
'123 Main St', // remove punctuation original
'123 main st', // remove punctuation lower case
'123 Main St', // remove punctuation First Letter Upper Case
'123 MAIN ST', // remove punctuation ALL Upper Case
'123 Main', // remove last word original
'123 main', // remove last word lower case
'123 Main', // remove last word First Letter Upper Case
'123 MAIN', // remove last word ALL Upper Case
'123 Main%', // remove last word original with wildcard
'123 main%', // remove last word lower case with wildcard
'123 Main%', // remove last word First Letter Upper Case with wildcard
'123 MAIN%' // remove last word ALL Upper Case with wildcard
);
那么查询将是这样的:
SELECT *
FROM tbl_name
WHERE address_1 IN (
'123 Main St.', '123 main st.', '123 Main St.',
'123 MAIN ST.', '123 Main St', '123 main st',
'123 Main St', '123 MAIN ST', '123 Main',
'123 main', '123 Main' '123 MAIN',
'123 Main%', '123 main%', '123 Main%',
'123 MAIN%'
)
看起来我必须做出大量的变化,但我仍然不确定这是否是最佳的方式。
更新:
嗯,这有点有效(来自堆栈问题 #2),
SELECT *
FROM tbl_name
WHERE LOWER(address_1) ILIKE LOWER('123 Main%')
对于每个添加地址字段搜索(例如办公室和家庭)使用 UNION 方法(来自堆栈问题 #1)
Example Data:
$address_1 = '123 Main St.';
$address_2 = 'Suite 200';
$phone = '1235551212';
$zip = '12345';
Example Database:
record_id, address_1, address_2, zip, phone
123, '123 main street', '', '12345', '1234567890'
124, '500 E. Ninja Road', 'Suite #200', '12345-1111', '(321)654-0987'
125, '222 where 4 east circle', 'P.O. Box 3', '11111', '1-123-555-1212'
This can be separate queries but I would need to search for partial matches in the address and zip and full matches in the phone.
So$address_1
would partial match record_id
123 as '123 Main St.' is a variation of '123 main street'$address_2
would match record_id
124 as 'Suite 200' is a variation of 'Suite #200'$phone
would match record_id
125 as '1235551212' is a variation of '1-123-555-1212'$zip
would match record_id
123 & 124 as '12345' is a variation of '12345-1111' and a match to '12345'
NOTE:
Also the values could be switched, meaning that $address_1
could be formated like this: '123 main street' and record_id
123 could be like this: '123 Main St.' (This applies to all fields)
I have been suggested to try ILIKE, LIKE, SIMILAR, CITEXT and FTS (Free Text Search) all of which are great but I'm not sure how to implement them to get the results I desire.
I don't mind running multiple queries for each like a query for $address_1
match and another for $address_2
match and so on. I also know that there will be false positives as well as false negatives but I'm hoping to be around 75% (or better) in accurancy.
One IMPORTANT note is that the Postgres Server is running version 7.4 and there are no plans to upgrade.
Also to add more complexity to the query there are multiple address_1
, address_2
, zip
and phone
(Think separate address/phone for like home and office)
Here are my first attempts at solving this problem:
I had an idea to produce the most common formats and then pass those as the parameters in the query.
Something like:
$address_1 = array(
'123 Main St.', // original
'123 main st.', // lower case
'123 Main St.', // First Letter Upper Case
'123 MAIN ST.', // ALL Upper Case
'123 Main St', // remove punctuation original
'123 main st', // remove punctuation lower case
'123 Main St', // remove punctuation First Letter Upper Case
'123 MAIN ST', // remove punctuation ALL Upper Case
'123 Main', // remove last word original
'123 main', // remove last word lower case
'123 Main', // remove last word First Letter Upper Case
'123 MAIN', // remove last word ALL Upper Case
'123 Main%', // remove last word original with wildcard
'123 main%', // remove last word lower case with wildcard
'123 Main%', // remove last word First Letter Upper Case with wildcard
'123 MAIN%' // remove last word ALL Upper Case with wildcard
);
Then the query would be something like this:
SELECT *
FROM tbl_name
WHERE address_1 IN (
'123 Main St.', '123 main st.', '123 Main St.',
'123 MAIN ST.', '123 Main St', '123 main st',
'123 Main St', '123 MAIN ST', '123 Main',
'123 main', '123 Main' '123 MAIN',
'123 Main%', '123 main%', '123 Main%',
'123 MAIN%'
)
Just seems like a ton of variations I would have to make and I'm still not sure if this would be the most optimal way.
UPDATE:
Well this kinda works (from Stack Question #2)
SELECT *
FROM tbl_name
WHERE LOWER(address_1) ILIKE LOWER('123 Main%')
With using the UNION method (from Stack Question #1) for each addition address field search like Office and Home
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果我理解正确,您需要从一个字段到另一个字段进行片段匹配(例如“123”匹配“123-111”和“12345”,但不匹配“122234”或“122-345”),并且在匹配时需要避免所有符号这是正确的吗?
下面的伪代码是我首先尝试的:
听起来实现这一点的最佳方法是通过 ILIKE百分号所在的查询是片段的开头和结尾,例如:
您必须多次执行此查询,这可能会带来性能问题,但我相信这可能会得到您想要的结果。
If I understand you correctly, you need fragmentary matches from field to field (e.g. '123" matches '123-111' and '12345' but not '122234' or '122-345') and you need to avoid all symbols when matching. Is this correct?
The psuedocode below is what I would try first:
It sounds like the best way to achieve this would be through ILIKE queries where the percentage signs are are the beginning and the end of the fragment, such as:
You would have to perform this query multiple times, which could be a performance headache, but I believe that may get you what you want.
我有一些方法建议。
我强烈考虑从电话号码中删除所有非数字。
你可以在 postgres 中使用类似的方法来完成此操作,
我不确定这是否适合你,但许多地址验证系统会忽略除用于匹配目的的数字之外的所有内容。
例如,
将匹配 12345678910. 你可以用这样的东西来实现...
另外你可以研究 postgres SoundEx 扩展。
http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html 这可能有助于简化纠正拼写变化、大小写、空格、标点符号等。
I have a few suggestions for approaches.
I would strongly consider stripping all non-numerics from phone number.
You can do this in postgres with something like this
I am not sure if this will work for you, but many address verification systems ignore everything except the numbers for matching purposes.
For example,
would match 12345678910. which you could implement with something like this...
additionally you could investigate the postgres SoundEx extentions .
http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html which may help simplify correcting spelling variations, casing, spacing, punctuation etc.