The short answer: Storing adderres or any kind of contact information in a database is complex. The Extendible Address Language (xAL) link above has some interesting information that is the closest to a standard/best practice that I've come accross...
"xAl is the closest thing to a global standard that popped up. It seems to be quite an overkill though, and I am not sure many people would want to implement it in their database..."
This is not a relevant argument. Implementing addresses is not a trivial task if the system needs to be "comprehensive and consistent" (i.e. worldwide). Implementing such a standard is indeed time consuming, but to meet the specified requirement nevertheless mandatory.
I basically see 2 choices if you want consistency:
Data cleansing
Basic data table look ups
Ad 1. I work with the SAS System, and SAS Institute offers a tool for data cleansing - this basically performs some checks and validations on your data, and suggests that "Abram Lincoln Road" and "Abraham Lincoln Road" be merged into the same street. I also think it draws on national data bases containing city-postal code matches and so on.
Ad 2. You build up a multiple choice list (ie basic data), and people adding new entries pick from existing entries in your basic data. In your fact table, you store keys to street names instead of the street names themselves. If you detect a spelling error, you just correct it in your basic data, and all instances are corrected with it, through the key relation.
Note that these options don't rule out each other, you can use both approaches at the same time.
The authorities on how addresses are constructed are generally the postal services, so for a start I would examine the data elements used by the postal services for the major markets you operate in.
enum address-fields
{
name,
company-name,
street-lines[], // up to 4 free-type street lines
county/sublocality,
city/town/district,
state/province/region/territory,
postal-code,
country
}
structure address-field-metadata
{
field-number, // corresponds to the enumeration above
field-index, // the order in which the field is usually displayed
field-name, // a "localized" name; US == "State", CA == "Province", etc
is-applicable, // whether or not the field is even looked at / valid
is-required, // whether or not the field is required
validation-regex, // an optional regex to apply against the field
allowed-values[] // an optional array of specific values the field can be set to
}
I've been thinking about this myself as well. Here are my loose thoughts so far, and I'm wondering what other people think.
xAL (and its sister that includes personal names, XNAL) is used by both Google and Yahoo's geocoding services, giving it some weight. But since the same address can be described in xAL in many different ways--some more specific than others--then I don't see how xAL itself is an acceptable format for data storage. Some of its field names could be used, however, but in reality the only basic format that can be used among the 16 countries that my company ships to is the following:
enum address-fields
{
name,
company-name,
street-lines[], // up to 4 free-type street lines
county/sublocality,
city/town/district,
state/province/region/territory,
postal-code,
country
}
That's easy enough to map into a single database table, just allowing for NULLs on most of the columns. And it seems that this is how Amazon and a lot of organizations actually store address data. So the question that remains is how should I model this in an object model that is easily used by programmers and by any GUI code. Do we have a base Address type with subclasses for each type of address, such as AmericanAddress, CanadianAddress, GermanAddress, and so forth? Each of these address types would know how to format themselves and optionally would know a little bit about the validation of the fields.
They could also return some type of metadata about each of the fields, such as the following pseudocode data structure:
structure address-field-metadata
{
field-number, // corresponds to the enumeration above
field-index, // the order in which the field is usually displayed
field-name, // a "localized" name; US == "State", CA == "Province", etc
is-applicable, // whether or not the field is even looked at / valid
is-required, // whether or not the field is required
validation-regex, // an optional regex to apply against the field
allowed-values[] // an optional array of specific values the field can be set to
}
In fact, instead of having individual address objects for each country, we could take the slightly less object-oriented approach of having an Address object that eschews .NET properties and uses an AddressStrategy to determine formatting and validation rules:
When setting a field, that Address object would invoke the appropriate method on its internal AddressStrategy object.
The reason for using a SetField() method approach rather than properties with getters and setters is so that it is easier for code to actually set these fields in a generic way without resorting to reflection or switch statements.
You can imagine the process going something like this:
GUI code calls a factory method or some such to create an address based on a country. (The country dropdown, then, is the first thing that the customer selects, or has a good guess pre-selected for them based on culture info or IP address.)
GUI calls address.GetMetadata() or a similar method and receives a list of the AddressFieldMetadata structures as described above. It can use this metadata to determine what fields to display (ignoring those with is-applicable set to false), what to label those fields (using the field-name member), display those fields in a particular order, and perform cursory, presentation-level validation on that data (using the is-required, validation-regex, and allowed-values members).
GUI calls the address.SetField() method using the field-number (which corresponds to the enumeration above) and its given values. The Address object or its strategy can then perform some advanced address validation on those fields, invoke address cleaners, etc.
There could be slight variations on the above if we want to make the Address object itself behave like an immutable object once it is created. (Which I will probably try to do, since the Address object is really more like a data structure, and probably will never have any true behavior associated with itself.)
Does any of this make sense? Am I straying too far off of the OOP path? To me, this represents a pretty sensible compromise between being so abstract that implementation is nigh-impossible (xAL) versus being strictly US-biased.
Update 2 years later: I eventually ended up with a system similar to this and wrote about it at my defunct blog.
I feel like this solution is the right balance between legacy data and relational data storage, at least for the e-commerce world.
发布评论
评论(9)
规范化您的数据库模式,您将拥有实现正确一致性的完美结构。 这就是为什么:
http://weblogs.sqlteam.com/mladenp/archive/2008/09/17/Normalization-for-databases-is-like-Dependency-Injection-for-code.aspx
normalize your database schema and you'll have the perfect structure for correct consistency. and this is why:
http://weblogs.sqlteam.com/mladenp/archive/2008/09/17/Normalization-for-databases-is-like-Dependency-Injection-for-code.aspx
我之前问过一些非常类似的问题:动态联系信息数据/设计模式:是这有什么可行的办法吗?。
简短的回答:在数据库中存储地址或任何类型的联系信息都很复杂。 上面的可扩展地址语言(xAL)链接有一些有趣的信息,这些信息最接近我遇到的标准/最佳实践......
I asked something quite similar earlier: Dynamic contact information data/design pattern: Is this in any way feasible?.
The short answer: Storing adderres or any kind of contact information in a database is complex. The Extendible Address Language (xAL) link above has some interesting information that is the closest to a standard/best practice that I've come accross...
“xAl 是最接近出现的全球标准的东西。不过,这似乎有点矫枉过正,我不确定很多人会想在他们的数据库中实现它......”
这不是一个相关的论点。 如果系统需要“全面且一致”(即全球范围内),那么实现地址并不是一项简单的任务。 实施这样的标准确实很耗时,但要满足指定的要求仍然是强制性的。
"xAl is the closest thing to a global standard that popped up. It seems to be quite an overkill though, and I am not sure many people would want to implement it in their database..."
This is not a relevant argument. Implementing addresses is not a trivial task if the system needs to be "comprehensive and consistent" (i.e. worldwide). Implementing such a standard is indeed time consuming, but to meet the specified requirement nevertheless mandatory.
在英国,有一款名为 Royal Mail 的 PAF
这为每个地址提供了一个唯一的密钥 - 不过,有一些障碍需要跳过。
In the UK there is a product called PAF from Royal Mail
This gives you a unique key per address - there are hoops to jump through, though.
如果您想要一致性,我基本上会看到 2 个选择:
广告 1. 我使用 SAS 系统,SAS Institute 提供了一个数据清理工具 - 这基本上对您的数据执行一些检查和验证,并提出建议将“Abram Lincoln Road”和“Abraham Lincoln Road”合并为同一条街道。 我还认为它利用了包含城市邮政编码匹配等的国家数据库。
广告 2. 您建立了一个多项选择列表(即基本数据),添加新条目的人从基本数据中的现有条目中进行选择。 在事实表中,您存储街道名称的键而不是街道名称本身。 如果您检测到拼写错误,您只需在基本数据中更正它,并且所有实例都会通过键关系进行更正。
请注意,这些选项并不相互排除,您可以同时使用这两种方法。
I basically see 2 choices if you want consistency:
Ad 1. I work with the SAS System, and SAS Institute offers a tool for data cleansing - this basically performs some checks and validations on your data, and suggests that "Abram Lincoln Road" and "Abraham Lincoln Road" be merged into the same street. I also think it draws on national data bases containing city-postal code matches and so on.
Ad 2. You build up a multiple choice list (ie basic data), and people adding new entries pick from existing entries in your basic data. In your fact table, you store keys to street names instead of the street names themselves. If you detect a spelling error, you just correct it in your basic data, and all instances are corrected with it, through the key relation.
Note that these options don't rule out each other, you can use both approaches at the same time.
在美国,我建议选择一家国家地址变更供应商,并根据他们返回的内容对数据库进行建模。
In the US, I'd suggest choosing a National Change of Address vendor and model the DB after what they return.
地址构建的权威机构通常是邮政服务,因此首先我将检查邮政服务针对您所在的主要市场所使用的数据元素。
请参阅万国邮政联盟的网站以获取非常具体和详细的信息关于国际邮政地址格式:http://www.upu.int/post_code/en/ postal_addressing_systems_member_countries.shtml
The authorities on how addresses are constructed are generally the postal services, so for a start I would examine the data elements used by the postal services for the major markets you operate in.
See the website of the Universal Postal Union for very specific and detailed information on international postal address formats:http://www.upu.int/post_code/en/postal_addressing_systems_member_countries.shtml
正如您所建议的,我将使用
Address
表,并且将其基于 xAL。I'd use an
Address
table, as you've suggested, and I'd base it on the data tracked by xAL.我自己也一直在思考这个问题。 到目前为止,这是我的松散想法,我想知道其他人的想法。
xAL(及其包含个人姓名的姐妹 XNAL)被 Google 和 Yahoo 的地理编码服务使用,从而赋予它一定的权重。 但由于相同的地址可以在 xAL 中以多种不同的方式描述(有些方式比其他方式更具体),所以我不明白 xAL 本身如何成为可接受的数据存储格式。 然而,它的一些字段名称可以使用,但实际上,我公司运送到的 16 个国家/地区中可以使用的唯一基本格式如下:
这很容易映射到单个数据库表中,只允许 NULL在大多数列上。 这似乎就是亚马逊和许多组织实际存储地址数据的方式。 所以剩下的问题是我应该如何在程序员和任何 GUI 代码都可以轻松使用的对象模型中对此进行建模。 我们是否有一个基本的
Address
类型,其中每种地址类型都有子类,例如AmericanAddress
、CanadianAddress
、GermanAddress
,等等? 这些地址类型中的每一种都知道如何格式化自己,并且可以选择了解一些有关字段验证的信息。它们还可以返回有关每个字段的某种类型的元数据,例如以下伪代码数据结构:
事实上,我们可以采用稍微不那么面向对象的方法,即使用
Address
对象避开 .NET 属性并使用AddressStrategy
来确定格式和验证规则:设置字段时,该
Address
对象将调用适当的方法在其内部AddressStrategy
对象上。使用
SetField()
方法而不是具有 getter 和 setter 的属性的原因是,代码更容易以通用方式实际设置这些字段,而无需求助于反射或 switch 语句。您可以想象这个过程是这样的:
address.GetMetadata()
或类似的方法并接收如上所述的AddressFieldMetadata
结构的列表。 它可以使用此元数据来确定要显示哪些字段(忽略is-applicable
设置为false
的字段)、为这些字段添加什么标签(使用field- name
成员),以特定顺序显示这些字段,并对这些数据执行粗略的表示级验证(使用is-required
、validation-regex
和允许值
成员)。field-number
(对应于上面的枚举)及其给定值调用address.SetField()
方法。 然后,Address
对象或其策略可以对这些字段执行一些高级地址验证,调用地址清理器等。如果我们想要创建
Address
,则上述内容可能会略有不同code> 对象一旦创建,其行为就像一个不可变的对象。 (我可能会尝试这样做,因为Address
对象实际上更像是一个数据结构,并且可能永远不会有任何与其自身相关的真实行为。)这一切有意义吗? 我是否偏离 OOP 道路太远了? 对我来说,这代表了一种非常明智的妥协,既抽象得几乎不可能实现(xAL),又与严格偏向美国之间做出妥协。
2年后更新:我最终得到了一个与此类似的系统,并在 我已不复存在的博客。
我觉得这个解决方案是遗留数据和关系数据存储之间的正确平衡,至少对于电子商务世界来说是这样。
I've been thinking about this myself as well. Here are my loose thoughts so far, and I'm wondering what other people think.
xAL (and its sister that includes personal names, XNAL) is used by both Google and Yahoo's geocoding services, giving it some weight. But since the same address can be described in xAL in many different ways--some more specific than others--then I don't see how xAL itself is an acceptable format for data storage. Some of its field names could be used, however, but in reality the only basic format that can be used among the 16 countries that my company ships to is the following:
That's easy enough to map into a single database table, just allowing for NULLs on most of the columns. And it seems that this is how Amazon and a lot of organizations actually store address data. So the question that remains is how should I model this in an object model that is easily used by programmers and by any GUI code. Do we have a base
Address
type with subclasses for each type of address, such asAmericanAddress
,CanadianAddress
,GermanAddress
, and so forth? Each of these address types would know how to format themselves and optionally would know a little bit about the validation of the fields.They could also return some type of metadata about each of the fields, such as the following pseudocode data structure:
In fact, instead of having individual address objects for each country, we could take the slightly less object-oriented approach of having an
Address
object that eschews .NET properties and uses anAddressStrategy
to determine formatting and validation rules:When setting a field, that
Address
object would invoke the appropriate method on its internalAddressStrategy
object.The reason for using a
SetField()
method approach rather than properties with getters and setters is so that it is easier for code to actually set these fields in a generic way without resorting to reflection or switch statements.You can imagine the process going something like this:
address.GetMetadata()
or a similar method and receives a list of theAddressFieldMetadata
structures as described above. It can use this metadata to determine what fields to display (ignoring those withis-applicable
set tofalse
), what to label those fields (using thefield-name
member), display those fields in a particular order, and perform cursory, presentation-level validation on that data (using theis-required
,validation-regex
, andallowed-values
members).address.SetField()
method using thefield-number
(which corresponds to the enumeration above) and its given values. TheAddress
object or its strategy can then perform some advanced address validation on those fields, invoke address cleaners, etc.There could be slight variations on the above if we want to make the
Address
object itself behave like an immutable object once it is created. (Which I will probably try to do, since theAddress
object is really more like a data structure, and probably will never have any true behavior associated with itself.)Does any of this make sense? Am I straying too far off of the OOP path? To me, this represents a pretty sensible compromise between being so abstract that implementation is nigh-impossible (xAL) versus being strictly US-biased.
Update 2 years later: I eventually ended up with a system similar to this and wrote about it at my defunct blog.
I feel like this solution is the right balance between legacy data and relational data storage, at least for the e-commerce world.