数据库中国际数据集的规范化/验证?
假设您正在处理您的常规联系人数据库(您知道...姓名、电话号码、地址、电子邮件等...)。如果你在本地担心这个问题,通常这不是一个大问题,但当我们考虑国际场景时,这就是一个大问题。
看看电话号码系统,您可能会认为它很简单,但事实并非如此。在北美,我们通常使用 1-222-333-4444 格式来呼叫他人。 当然,这又分为国际拨号代码、区号、交换前缀和线路号码。问题:真实的电话号码有限,美国潜在的 1000 个区号中大约有 220 个,每个区号只有有限数量的交换机,并且线路号码仅限于该国家/地区的特定用途(例如, 911 的模式受到限制,只有 10,000 中的约 3/4 正在使用)。拿英国来说,他们有自己的一套行号规则,例如保留 0300-0399 块的大部分用于特定用途,以及其他限制。国际代码也受到限制。规范区号、交换和放置 对电话号码的数据验证检查变得复杂。我不会详细介绍我们何时进入不属于 NPA 计划的地方,但让我们确定我们不能真正信任北美模板,放松一下,然后就到此为止了。
对于这样的事情我们如何正常化?我们如何验证数据?我们如何处理这些看似临时的分机代码或内部拨号指令?
国际地址也好不了多少,不仅保留的数据之间存在差异,而且输出格式也不尽相同。当加拿大的格式为 A1A1A1,而美国的系统为 55555[-4444] 时,我们如何处理国际邮政编码?
我很想在遇到这些情况时为每种情况编写类,将它们以 XML/JSON/类似的形式存储在数据库中,但是如何关联字段并轻松搜索我的内容呢?我不想最终为每个国家/地区创建一个包含数千张表的表。我想要一个易于扩展的解决方案,可以规范我的地址并验证内容。这个要求是否太过分了?
Lets say you are dealing with your normal contact database (you know... name, phone number, address, email, etc...). If you're concerened about this locally, it's generally not a big issue to deal with, but when we look at international sets it is.
Looking at the phone number system, you would think it's simple, but it's really not. In north america, we generally have 1-222-333-4444 format for calling people.
This is of course divieded down into your international dialing code, area code, exchange prefix and line number. Problem: real phone numbers are limited, there are around 220 area codes in the US out of the potential 1000, each area code only has a limited number of exchanges, and the line numbers are restricted to specific use under that country (for example, patterns with 911 are restricted, only about 3/4ths of the 10,000 are in use). Take this over to the UK, they have their own set of rules for line numbers, such as reserving most of the 0300-0399 block to specific use, and other restrictions. International codes are also limited. Normalizing area codes, exchanges, and putting
data validation checks onto phone numbers just got complicated. I'm not going into detail about when we go into places that are not part of the NPA scheme, but lets just identify that we can't really trust the north american template, kick back, and call it a day.
How do we normalize for things like this? How do we validate data? How do we deal with these seemingly ad-hoc extension codes or instructions for internal dialing?
International addresses are not much better, the differences between not just data retained, but also output formats aren't the same across the board. How do we deal with international postal codes, when in canada the format is A1A1A1, and the USA has a system such as 55555[-4444]?
I'm tempted to just write classes for each of these situations as I encounter them, store them in the database as XML/JSON/similar, but then how do I relate fields and easily search my content? I don't want to end up creating a table moster of thousands of tables for each country. I want an easily scalable solution where I can normalize my addresses and validate the content. Is this too much to ask?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
解决这个问题的一种方法可能是:
采用地址/电话号码/邮政编码等的三种视图。
第一个视图是将地址(比如说)作为多行文本。
第二个视图是地址标签(更多内容见下文)。
第三个视图是地址验证规则的视图
此方法所需的其他组件是用于验证的通用过程(类/触发器);用于打印目的的格式化例程;和一个规则库,以及一个更新验证规则的管理员机制。一条“包罗万象”的规则,表明这是一个有效的地址 - 它已经过手动验证 - 即使它不符合您的规则库的任何规则。
组成部分:
1 地址由多行组成,每行都有一个关联的序列号和多个标签(通常为一个)。还可以与地址行、验证规则的规则集和规则版本相关联,但这是一种取决于您的更新/插入/计算速率的细化。
2 地址标签是城市之类的东西;镇;门牌号码;并识别地址的不同行。地址行可能没有任何标签,但在这种情况下,只能在整组行上进行一般搜索(例如纽约)。无法搜索“城市=纽约”)。
3 以领域特定语言编写的规则。这可以是正则表达式;适合您的语言;您的正常开发语言(尽管这可能是最无用的方法,因为编程语言可能会发现很难简洁而准确地表示我正在谈论的那种规则。代表性规则的一个例子可能是(处理您的描述)美国邮政编码)-
前五个字符必须是数字。
前五个字符代表“区号”。
邮政编码必须是地址的最后一行。
规则将分为组和集(例如美国地址)。规则必须能够引用其他规则以及地址数据。
您的验证例程需要获取地址行并对其应用规则(通常按设置)。它将返回一个布尔值 - 有效或无效地址,以及可选的验证所依据的规则集。
打印例程将再次应用适当的规则集(可能与您的验证集不同)到地址数据以提供格式化地址。
我希望其他组件从整体方法中是显而易见的。
此方法旨在解决您的问题中发现的以下问题:
电话代码分区。
正在使用的可能的区号数量有限。
为特定目的保留的电话号码块。
数据标准化 - 数据被标准化。然而,除了数据仓库软件和包含大量实时传感器信息的数据库之外,这种类型的规范化(反向索引)并不普遍使用。在实施此解决方案时,您可能最终会选择(可控地)复制数据。这不是解决方案的固有部分,但可能很方便。
我强烈建议不要为每个变体添加类 - 这不可扩展,也不可维护。
下面详细介绍了搜索。
避免使用怪物表 - 规则库可能是实际数据之外的数百到数千条规则的顺序。
该解决方案是可扩展的 - 只需添加或修改规则。
并处理一些相关问题。
即使您可以将验证规则应用于国家地址格式,特定国家/地区的标准也总会存在例外情况。我自己的地址就是一个例子 - 我住在船上,需要在我的地址中包含除邮局标准地址之外的其他信息。此类异常总是可能需要手动干预 - 因此手动干预接受的规则。
不同国家/地区的地址顺序不同,例如中国的地址写作:国家/地区;邮政编码;城市;市区;街道名称;门牌号码;人名。
接受来自您没有规则的区域的第一个地址,并且该国家/地区的规则与您记录的任何规则不同。
想要使用与“他们的”地址不同的地址(例如办公室)的人。
特定个人解决问题 - 有人希望向最亲近的人隐瞒他们的信件。
该解决方案可以扩展到类似的问题 - 例如指一个人。这可能涉及头衔——博士、牧师等;多重连字符名称 (ffoulkes-symthe);资格(注册会计师、理学学士等;以及家族资格(第三等);以及取决于个人文化的多种命名方式(来自印度次大陆的人通常没有姓氏,并且每个人显然都有不同的姓氏) )。
问题仍然存在。出现的问题是:
搜索将更加复杂 - 需要搜索地址的所有行和关联标签而不是特定地址字段
建立此类规则库需要时间 - 而初始规则加载可以相当快地完成 - 您的问题中存在示例 - 任何类似完整集的内容仅在您处理完之后才会出现有多个例外和异常。
编辑:
当我写回复时,我并不知道 BigTable。快速浏览一下,这似乎是一个非常相似的概念。至于在 ACID 开发中实现它,我认为这对于联系人数据集和规则数据库来说是可能的。根据经验,我确实知道在这样的环境中,它可以轻松扩展到 5*10^7 组联系详细信息(尽管有后台、对联系详细信息的非时间关键验证)。
在考虑 ACID/SQL 情况时,我对“视图”一词的使用可能已经开始朝着我不希望的方向发展。更合适的词可能是概述或展望(没有关系模型或 DBMS 附加的东西)。事实上,我会将我称为视图的每一个事物视为候选表。
我为我的方法绘制了一个架构草图来协助讨论。该模式使用一些 M:N 连接,这些连接显然会在实现中规范化为关联表。这是留给读者的练习:-)。我已将一些示例属性和数据放入一些表中。
在草图中,编辑规则集;规则;与其他规则(管理应用程序)相关的规则显然可以通过 SQL 正常 CRUD 操作对地址用户使用 ACID 属性来完成;地址;地址行同样可以以 ACID/SQL 方式完成,将地址行作为用户的输入。您可能希望对交易进行一些预处理,以将门牌号(在我的示例中)与道路名称分开,从而失去“道路名称”规则。其他可能有用的预处理是大写标准化 - 尽管这也可能是验证步骤的一部分。也可以简单地接受完整的行“9 Richmond Road”作为输入,并将其标记为“需要验证”。无论哪种情况,(据我所知)使该事务保持 ACID 都没有问题。
我不太清楚的是如何将验证和后续标记合并到 ACID/SQL 事务中。看起来确实要使验证步骤 ACID,可能需要根据规则集命令应用测试(首先测试最常见的情况,并在验证成功时停止)。此外,为了使事务具有 ACID,可能有必要仅针对最常见的情况进行验证,而将其他情况标记为“需要验证”,然后将其作为后台任务完成。
验证的实际任务涉及逐组检查整个规则库,直到找到一组验证所有输入行的规则。数据本身显然有潜在的指导方针 - 如果您记录了国家/地区,那么这既使您能够立即用国家/地区标记该行,又为您必须测试的规则集提供过滤器。很抱歉,但这就是迄今为止我在这方面所能做到的。
顺便说一句,这个草图模式只是实现完全标准化的一部分。如图所示,每个地址都有自己的地址行序列,并且除了作为输入的各个行之外,没有数据标准化(加上或减去您选择执行的任何预处理)。可以进一步采用这种方法 - 通过在地址和地址行 M:N 之间建立链接并确保地址行表的行字段是其主键。
当谈到这个概念的更详细资源时,我有两个问题。
第一个(也是微不足道的)一点是,这个概念是我的原创作品,基于我作为方法顾问和 IT 策略顾问二十多年的经验,对开发环境技术和开发方法特别感兴趣。我的整个工作生涯都是在联系方式成为主要关注点的环境中度过的(出于财务和监管/立法原因)。事实上,在我读完你的问题之前,我对你问题的最初回答就已经在我的脑海里完成了,尽管我花了大约四分之三小时才把它打出来。
更重要的原因是这个想法的一些来源是机密或秘密的。在我的职业生涯中,我的部分工作涉及了解最新的技术发展,并预测十年后技术对业务的影响。这包括参观研究实验室并与领先研究人员就各种主题进行讨论。虽然我自己并不是一流的研究人员,但我似乎非常擅长综合他人的研究成果。然而,在这样做时,我始终在商业机密和/或军事机密的条件下进行操作。我的回答都没有违反这些条件。因此,我只能就如何得出信息提供模糊的指导。
我的资料来源是:
由 CJ Date 在 IBM 举办的研讨会,探讨规范化和关系模型的进一步发展(不是 SQL 中实现的关系模型)。这涉及到对第五 (?) 和第六 (?) 范式的探索。
一段时间内与Oracle技术人员进行的一系列讨论,讨论元数据;元元数据;以及其他此类概括。
与英国军事研究机构的讨论。尽管这是几年前的事了,但我不确定是否曾发表过有关我们正在讨论的主题的任何内容。
在一家大型金融机构工作,其联系方式系统的形状与我的提案非常相似,但源于非关系根源;最初的技术推动力是在内存、持久内存和备份容量成为主要关注点的时代节省空间。
编辑:
我完成了上述编辑,关闭电脑,做了一些家务,上床睡觉,躺下,闭上眼睛,我找到了我无法完成的部分的解决方案之前的编辑。
虽然在标记/验证时需要进行更新,但大部分工作实际上是阅读和比较。因此,它是乐观锁定的主要候选者。在伪代码中,这看起来像:
所有三种类型的事务(地址、地址行更新、规则集、规则更新和标签、标签使用更新)都满足 ACID 条件。我坚信(但尚未证明)这三种类型的事务的任何交错、组合或交叉集也将满足 ACID 条件。
A way of approaching this problem may be:
adopt three views of addresses/phone numbers/post codes etc.
The first view is that of addresses (say) as multiple lines of text.
The second view is that of address tags (more on this below).
The third view is that of validation rules for addresses
The other components that are needed for this approach are a generic procedure (class/trigger) for validation; a formatting routine for printing purposes; and a rule base, together with an administrators mechanism to update the validation rules. A "catchall" rule that says this is a valid address - it has been validated manually - even if it meets none of the rules of your rule base.
The components:
1 Address is made up of multiple lines each of which has an associated sequence number and a number of tags (usually one). It is possible to also associate with an address line, the set of rules and versions of rules that they were validated against, but this is a refinement that depends on your update/insert/calculation rates.
2 The address tags are things like city; town; house number; and identify the different lines of an address. It is possible to have an address line that does not have any tags, but in this case, only generic searches (eg for New York) are possible on the full set of lines. Searches for "City = New York") are not possible.
3 Rules written in a domain specific language. This could be regular expression; a language specific to you; your normal development language (though this is likely to be the least useful approach, as programming languages can find it difficult to consisely and accurately represent rules of the sort I am talking about. An example of representative rules might be (dealing with your description of US Zip Codes) -
the first five characters must be digits.
the first five characters represent an "area code".
the zip code must be the last line of an address.
The rules will be divided into groups and sets (eg US addresses) Rules must be able to refer to other rules as well as the address data.
Your validation routine needs to take the lines of an address and apply the rules to it (usually by set). It will return a boolean - valid or invalid address and optionally the set of rules it validated against.
The printing routine again will apply the appropriate rule set (probably different from your validation set) to the address data to provide a formatted address.
I hope the other componenets are obvious from the overall approach.
This approach is intended to deal with these issues identified in your question:
Partitioning of phone codes.
Limited number of possible area codes in use.
Blocks of phone numbers reserved for specific purposes.
Normalisation of data - the data is normalised. However, this type of normalisation (reverse indexing) is not generally used, except in data warehouse software, and databases containing massive real time sensor information. It is possible that in implementing this solution you may end up choosing to (controllably) duplicate data. This is not an inherent part of the solution, but may be convenient.
I would strongly suggest NOT adding classes for each variant - this is not scalable, nor maintainable.
Searching is covered in detail below.
Monster tables are avoided - the rule base is likely to be or the order of hundreds to low thousands of rules additional to the actual data.
The solution is scalable - one simply adds or amends rules.
and also deal with some related problems.
Even if you can apply validation rules to national formats of addresses, there will always be exceptions to the standards for the particular country. My own address is an example - I live on a boat, and need additional information included in my address, over and above the Post Office standard address. Anomalies of this sort are always likely to need manual intervention - hence the rule for accepted by manual intervention.
Different countries have different orderings for addresses - addresses in China for instance are written: Country; Post Code; City; City Zone; Street Name; House Number; Person Name.
Acceptance of the first address from an area where you have no rules, and the rules of the country are different from any you have recorded.
People wanting to use (eg an office) address different from "their" address.
Specific individual addressing concerns - somebody wishing to conceal their correspondence from their nearest and dearest.
The solution can be extended to like problems - eg referring to a person. This can involve titles - Dr, Rev, etc; multiply hyphenated names (ffoulkes-symthe); qualifications (CPA, BSc, etc; and familial qualifications (the third, etc); and multiple modes of naming depending on the persons culture (people from the Indian subcontinent often do not have a family name, and each will apparently have a different surname).
Changes to addressing rules (eg addition of a new Zip Code for a new development) can be made easily and quickly.
Problems still arising are:
Searches will be somewhat more complicated - needing to search all lines and associated tags of addresses rather than specific address fields
It takes time to build up a rule base of this sort - whilst initial rule loading can be done fairly quickly - examples are present in your question - anything like a complete set will only be present after you have dealt with multiple exceptions and anomalies.
EDIT:
I wasn't aware of BigTable when I wrote my response. Having had a very quick look at this, it does seem to be a very similar concept. As for implementing it in a ACID development I think this is possible for the contact dataset, and for the rule data base. From experience I do know it scales easily to 5*10^7 sets of contact details in such an environment (though with background, non-time critical validation of contact details).
In considering the ACID/SQL case my use of the word "view" may have started a hare going off in a direction I did not intend. A more appropriate word may have been overview or outlook (something with no relational model or DBMS freight attached). In fact I would consider each of the things I referred to as a view, as being a candidate table.
I set out a sketch of a schema for my approach to assist discussion. This schema uses some M:N connections, which would obviously be normalised out in an implementation as associative tables. This is left as an exercise for the reader :-). I have put a few sample attributes and data into some of the tables.
In the sketch, editing of Rule Set; Rule; and rules relating to other rules (the administrative application) can obviously be done with the ACID properties by SQL Normal CRUD operations on Address User; Address; and Address Line can equally be done in an ACID/SQL manner, taking the address lines as input by the user. You may wish to do some preprocessing of the transaction to separate out (in my example) House Number from Road Name and consequentially lose the "Road Name" rule. Other pre-processing that is likely to be useful is the standardisation of capitalisation - though this could also be part of the validation step. It is also possible to simply accept the complete line "9 Richmond Road" as input, and tag it as "Needs Validation". In either case there is no problem (that I am aware of) about making this transaction ACID.
Where I am less clear is how the validation and subsequent tagging can be incorporated into the ACID/SQL transaction. It does seem that to make the validation step ACID, it may be necessary to order the application of the testing against the rule sets (with the most common cases tested first, and stopping when a validation is successful). Also it may be necessary, in order to make the transaction ACID, to validate only against your most common case, leaving the others tagged as "Need Validation", which then would be done as a background task.
The actual task of validation involves checking the entire rule base, set by set, until a set of rules is found which validates all the input lines. There are obviously potential guidelines from the data itself - if you have recorded country, then this both enables you to tag the line with country in straight away, and provides a filter for the rule sets you have to test against. I am sorry but this is as far as I can take this aspect so far.
Incidentally, this sketch schema only goes part way towards total normalisation. As drawn, each address has its own sequence of address lines, and there is no normalisation of the data beyond the individual lines as input (plus or minus any pre-processing you choose to do). It is possible to take this approach even further - by making the link between Address and Address Line M:N and ensuring that the Line field of the Address Line table is its primary key.
When it comes to more detailed resources for this concept I have two problems.
The first (and trivial) one is that this concept is my original work, based on my experience over more than twenty years as a method consultant and IT Strategy consultant with a special interest in development environment technology and development methods. All my working life has been spent in environments where contact details have been a major concern (for financial and regulatory/legislative reasons). In fact my original response to your question was complete in my mind before I had finished reading your question, even though it did then take me about three-quarters of an hour to get it typed up.
The more important reason is that some of the sources of this idea are confidential or secret. In my working career, part of my job involved keeping up to date with technology developments, and predicting the impact of technology on the business in ten years time. This involved visiting research labs and having discussions with leading researchers about a variety of topics. While I am not, myself, a first class researcher, I do seem to be very good at synthesising the research efforts of others. However, while doing this I was always operating under conditions of commercial confidentiality and/or military secrecy. None of my answer has breached these conditions. As a result of this I can only give vague guidelines as to how the information was derived.
My sources are:
a seminar conducted by C J Date, at IBM, exploring the further reaches of normalisation and the Relational Model (NOT the Relational Model as implemented in SQL). This involved exploration of fifth (?) and sixth (?) normal forms.
a series of discussions with Oracle technical staff over a period of time, discussing meta data; meta meta data; and other such generalisations.
discussions with a UK based military research establishment. Though this was some years ago, I am not sure whether anything has ever been published on the topics we were discussing.
working in a large financial institution whose contact details system was shaped much like my proposal, but arose from non-relational roots; and for which the original technical impetus was space saving in an era when memory, persistent memory, and back up capacity was a major concern.
EDIT:
I completed the above edit, shut up my computer, did some household chores, went to bed, ley down, closed my eyes, and I had the solution to the part I could not complete in that previous edit.
While there are updates to be done when tagging/validating much of the work is actually reading and comparison. As such it is a prime candidate for optimistic locking. In pseudo code this would look like:
All three types of transaction (Address, Address Line updates, Rule Set, Rule updates, and Tag, Tag Usage updates) fulfil the ACID conditions. I strongly believe (but have not proved) that any interleaving, combination, or intersecting set of these three types of transaction will also fulfill the ACID conditions.
可能已经回答至少有电话号码。您可以对邮政编码执行类似的操作。
Possibly already answered for phone numbers at least. You could do something similar for postal codes.
如果我要实现这个,我会将电话号码、邮政编码等保存为常规字符串。特别是,数据应该以最终用户需要的格式存储。 (假设每个最终用户都有相同的需求。)例如,有德国地址:“Roadname 123”,美国地址? “123路名”。对邮政编码执行相同的操作,将它们与城市名称结合起来。您可以将地址保存为address_line_1(街道名称、按照用户输入的国家/地区特定顺序排列的门牌号)、address_line_2(邮政编码、城市名称...)。
如果您仍然需要在数据库中搜索特定的邮政编码,您可以为此编写正则表达式甚至函数。考虑到城市名称,您可以将它们从 address_line_2 中删除,并且很可能您最终会得到邮政编码。
我认为为每个国家/地区编写验证肯定是一项艰巨的工作,有 200 个国家/地区...您如何确定您没有错过一些当地的会议?您可以编写一个函数 eq 来计算 eq("ABCDE-34", "ABCDE.34") == true。
尽管我并没有真正看到编写客户端和服务器端验证的意义。即使客户端是 Web 浏览器,您也可以通过 AJAX 使用服务器的验证。
最后,这取决于您使用的 DBMS(支持 Java 存储过程?)、您的客户端语言...以及数据的输入方式(是否在 Web 浏览器中输入非常不准确?)以及您想要什么与它有关。 (您是否计划向 Skype 提供数据库中的电话号码,或者这些电话号码是否由在手机中键入的人员读取?)您需要执行一些特定的连接操作吗?当然,这取决于您能够花费多少工时来解决该问题......
If I would implement this I would save phone numbers, zip codes etc. as regular strings. In particular the data should be stored in the format the end user needs it. (Assuming every end user has the same needs.) E.g. having a German adress: "Roadname 123", U.S. Adress? "123 Roadname". Doing the same for zip codes, combine them with the City name. You could save the addresses as address_line_1 (street name, house number in country specific order the user enters), address_line_2 (postal code, city name ...).
If you still need to search your database for specific postal codes you might write a regular expression or even a function for that. Considering City names you could wipe them out of address_line_2 and with high probability you end up having the postal code.
I think writing validations for each country must be tremendous work, that's 200 countries... How can you be sure that you didn't miss some local conventions? You could write a function eq that for instance evaluates eq("ABCDE-34", "ABCDE.34") == true.
Though I don't really see the point in writing Client side and Server side validations. Even if the client is a web browser, you might use the server's validations via AJAX.
In the end it depends on the DBMS you are using (support for Java stored procedures?), your client side language... Also how the data is entered (is it entered very inaccurately? in a Web browser?) and what you want to do with it. (Are you planning to feed Skype with phone numbers from your database or are these read by humans who type them in their phone?) Do you need to do some specific join operations? And of course it depends how many man-hours you are able to spend with solving that problem...