地址/邮政编码和电话的部分/匹配的复杂查询帮助

发布于 2024-10-13 10:46:05 字数 3517 浏览 1 评论 0原文

示例数据：

$address_1  = '123 Main St.';
$address_2  = 'Suite 200';
$phone      = '1235551212';
$zip        = '12345';

示例数据库：

record_id, address_1, address_2, zip, phone
123, '123 main street', '', '12345', '1234567890'
124, '500 E. Ninja Road', 'Suite #200', '12345-1111', '(321)654-0987'
125, '222 where 4 east circle', 'P.O. Box 3', '11111', '1-123-555-1212'

这可以是单独的查询，但我需要搜索地址和邮政编码中的部分匹配项以及电话中的完整匹配项。

所以 $address_1 会将 record_id 123 部分匹配为“123 Main St.”是“123 main street”的变体
$address_2 将匹配 record_id 124，因为“Suite 200”是“Suite #200”的变体
$phone 将匹配 record_id 125，因为“1235551212”是“1-123-555-1212”的变体
$zip 将匹配 record_id 123 & 124 作为“12345”是“12345-1111”的变体并且与“12345”匹配

注意：此外，这些值可以切换，这意味着 $address_1 可以采用如下格式：“123 main street”，而 record_id 123 则可以采用以下格式：“123 Main St.” （这适用于所有字段）

建议我尝试喜欢，喜欢、类似、CITEXT 和 FTS（自由文本搜索）所有这些都很棒，但我不知道如何实施它们以获得我想要的结果。我不介意为每个查询运行多个查询，例如查询 $address_1 匹配，另一个查询 $address_2 匹配，等等。我也知道会有误报和漏报，但我希望准确率能达到 75% 左右（或更好）。

一个重要注意事项是 Postgres 服务器正在运行版本 7.4，并且没有升级计划。

另外，为了增加查询的复杂性，还有多个 address_1、address_2、zip 和 phone（考虑单独的地址/phone（例如家庭和办公室））

这是我解决此问题的第一次尝试：

我有一个想法，生成最常见的格式，然后传递这些格式作为查询中的参数。

类似的东西：

$address_1  = array(
   '123 Main St.', // original
   '123 main st.', // lower case
   '123 Main St.', // First Letter Upper Case
   '123 MAIN ST.', // ALL Upper Case
   '123 Main St',  // remove punctuation original
   '123 main st',  // remove punctuation lower case
   '123 Main St',  // remove punctuation First Letter Upper Case
   '123 MAIN ST',  // remove punctuation ALL Upper Case  
   '123 Main',     // remove last word original
   '123 main',     // remove last word lower case
   '123 Main',     // remove last word First Letter Upper Case
   '123 MAIN',     // remove last word ALL Upper Case 
   '123 Main%',    // remove last word original with wildcard
   '123 main%',    // remove last word lower case with wildcard
   '123 Main%',    // remove last word First Letter Upper Case with wildcard
   '123 MAIN%'     // remove last word ALL Upper Case with wildcard
);

那么查询将是这样的：

SELECT * 
FROM tbl_name
WHERE address_1 IN (
   '123 Main St.', '123 main st.', '123 Main St.',
   '123 MAIN ST.', '123 Main St', '123 main st',
   '123 Main St', '123 MAIN ST', '123 Main',
   '123 main', '123 Main' '123 MAIN',
   '123 Main%', '123 main%', '123 Main%', 
   '123 MAIN%'
)

看起来我必须做出大量的变化，但我仍然不确定这是否是最佳的方式。

更新：

嗯，这有点有效（来自堆栈问题 #2），

SELECT * 
FROM tbl_name
WHERE LOWER(address_1) ILIKE LOWER('123 Main%')

对于每个添加地址字段搜索（例如办公室和家庭）使用 UNION 方法（来自堆栈问题 #1）

原文

Example Data:

$address_1  = '123 Main St.';
$address_2  = 'Suite 200';
$phone      = '1235551212';
$zip        = '12345';

Example Database:

record_id, address_1, address_2, zip, phone
123, '123 main street', '', '12345', '1234567890'
124, '500 E. Ninja Road', 'Suite #200', '12345-1111', '(321)654-0987'
125, '222 where 4 east circle', 'P.O. Box 3', '11111', '1-123-555-1212'

This can be separate queries but I would need to search for partial matches in the address and zip and full matches in the phone.

So
$address_1 would partial match record_id 123 as '123 Main St.' is a variation of '123 main street'
$address_2 would match record_id 124 as 'Suite 200' is a variation of 'Suite #200'
$phone would match record_id 125 as '1235551212' is a variation of '1-123-555-1212'
$zip would match record_id 123 & 124 as '12345' is a variation of '12345-1111' and a match to '12345'

NOTE:
Also the values could be switched, meaning that $address_1 could be formated like this: '123 main street' and record_id 123 could be like this: '123 Main St.' (This applies to all fields)

I have been suggested to try ILIKE, LIKE, SIMILAR, CITEXT and FTS (Free Text Search) all of which are great but I'm not sure how to implement them to get the results I desire.
I don't mind running multiple queries for each like a query for $address_1 match and another for $address_2 match and so on. I also know that there will be false positives as well as false negatives but I'm hoping to be around 75% (or better) in accurancy.

One IMPORTANT note is that the Postgres Server is running version 7.4 and there are no plans to upgrade.

Also to add more complexity to the query there are multiple address_1, address_2, zip and phone (Think separate address/phone for like home and office)

Here are my first attempts at solving this problem:

I had an idea to produce the most common formats and then pass those as the parameters in the query.

Something like:

$address_1  = array(
   '123 Main St.', // original
   '123 main st.', // lower case
   '123 Main St.', // First Letter Upper Case
   '123 MAIN ST.', // ALL Upper Case
   '123 Main St',  // remove punctuation original
   '123 main st',  // remove punctuation lower case
   '123 Main St',  // remove punctuation First Letter Upper Case
   '123 MAIN ST',  // remove punctuation ALL Upper Case  
   '123 Main',     // remove last word original
   '123 main',     // remove last word lower case
   '123 Main',     // remove last word First Letter Upper Case
   '123 MAIN',     // remove last word ALL Upper Case 
   '123 Main%',    // remove last word original with wildcard
   '123 main%',    // remove last word lower case with wildcard
   '123 Main%',    // remove last word First Letter Upper Case with wildcard
   '123 MAIN%'     // remove last word ALL Upper Case with wildcard
);

Then the query would be something like this:

SELECT * 
FROM tbl_name
WHERE address_1 IN (
   '123 Main St.', '123 main st.', '123 Main St.',
   '123 MAIN ST.', '123 Main St', '123 main st',
   '123 Main St', '123 MAIN ST', '123 Main',
   '123 main', '123 Main' '123 MAIN',
   '123 Main%', '123 main%', '123 Main%', 
   '123 MAIN%'
)

Just seems like a ton of variations I would have to make and I'm still not sure if this would be the most optimal way.

UPDATE:

Well this kinda works (from Stack Question #2)

SELECT * 
FROM tbl_name
WHERE LOWER(address_1) ILIKE LOWER('123 Main%')

With using the UNION method (from Stack Question #1) for each addition address field search like Office and Home

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

意中人 2024-10-20 10:46:05

如果我理解正确，您需要从一个字段到另一个字段进行片段匹配（例如“123”匹配“123-111”和“12345”，但不匹配“122234”或“122-345”），并且在匹配时需要避免所有符号这是正确的吗？

下面的伪代码是我首先尝试的：

处理所有输入字段以删除符号
标记字段以确定要匹配的片段
对片段与数据值进行匹配

听起来实现这一点的最佳方法是通过 ILIKE百分号所在的查询是片段的开头和结尾，例如：

SELECT * FROM [TABLE_NAME] WHERE address_1 ILIKE '%fragment%'

您必须多次执行此查询，这可能会带来性能问题，但我相信这可能会得到您想要的结果。

If I understand you correctly, you need fragmentary matches from field to field (e.g. '123" matches '123-111' and '12345' but not '122234' or '122-345') and you need to avoid all symbols when matching. Is this correct?

The psuedocode below is what I would try first:

Process all input fields to remove symbols
Tokenize the fields to determine fragments to match
Perform matches on fragments to data values

It sounds like the best way to achieve this would be through ILIKE queries where the percentage signs are are the beginning and the end of the fragment, such as:

SELECT * FROM [TABLE_NAME] WHERE address_1 ILIKE '%fragment%'

You would have to perform this query multiple times, which could be a performance headache, but I believe that may get you what you want.

回复收藏 0 原文

-黛色若梦 2024-10-20 10:46:05

我有一些方法建议。

我强烈考虑从电话号码中删除所有非数字。
你可以在 postgres 中使用类似的方法来完成此操作，

SELECT id FROM [table] WHERE regexp_replace(phone, '[^0-9]', '', 'g') = [$phone];

我不确定这是否适合你，但许多地址验证系统会忽略除用于匹配目的的数字之外的所有内容。

例如，

12-34 E. 5th street
APT 6B
City, ST 78910

将匹配 12345678910. 你可以用这样的东西来实现...

select * FROM [table] 
WHERE regexp_replace(address_1 || address_2 || zip, '[^0-9]', '', 'g') = $addressNumbers;

另外你可以研究 postgres SoundEx 扩展。
http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html 这可能有助于简化纠正拼写变化、大小写、空格、标点符号等。

I have a few suggestions for approaches.

I would strongly consider stripping all non-numerics from phone number.
You can do this in postgres with something like this

SELECT id FROM [table] WHERE regexp_replace(phone, '[^0-9]', '', 'g') = [$phone];

I am not sure if this will work for you, but many address verification systems ignore everything except the numbers for matching purposes.

For example,

12-34 E. 5th street
APT 6B
City, ST 78910

would match 12345678910. which you could implement with something like this...

select * FROM [table] 
WHERE regexp_replace(address_1 || address_2 || zip, '[^0-9]', '', 'g') = $addressNumbers;

additionally you could investigate the postgres SoundEx extentions .
http://www.postgresql.org/docs/8.3/static/fuzzystrmatch.html which may help simplify correcting spelling variations, casing, spacing, punctuation etc.

回复收藏 0 原文

~没有更多了~