访问无模式数据的最佳实践?
我正在研究 RDF,特别是如何访问存储在 rdf 存储中的信息。与传统关系数据库的巨大区别是缺乏预定义的模式:在关系数据库中,您知道表具有这些列,并且可以从技术上将每一行映射到类的实例。该类具有定义明确的方法和定义明确的属性。
在无模式系统中,您不知道哪些数据与给定信息相关联。这就像拥有一个具有任意且未预定义数量的列的数据库表,并且每行可以在任意数量的这些列中包含数据。
与对象关系映射器类似,还有对象 RDF 映射器。 RDFAlchemy 和 SuRF 是我现在正在玩的两个。基本上,它们为您提供一个 Resource 对象,其方法和属性是动态提供的。这有点道理……但是,事情并不那么容易。在许多情况下,您更喜欢拥有一个定义良好的界面,并更好地控制在模型对象上设置和获取数据时发生的情况。从某种意义上说,拥有这样的通用访问权限会让事情变得困难。
我注意到的另一件事(也是最重要的)是,即使在一般中,无模式数据预计会提供有关资源的任意信息,但在实践中您更多或较少了解往往在一起的“信息类别”。当然,您不能排除附加信息的存在,但在某些情况下,这是例外,而不是常态,尽管例外对于严格的模式来说足够明智,以至于具有太大的破坏性。在文章的 rdf 表示中(例如,在 RSS/ATOM 提要中),您知道所描述资源的术语,并且可以将它们映射到定义良好的对象。如果您提供附加信息,则可以定义一个扩展对象(从基础对象继承)来提供对增强信息的访问器。因此,从某种意义上说,您可以通过“面向模式的对象”来处理无模式数据,当您想要查看您感兴趣的特定附加信息时,您可以对其进行扩展。
我的问题与您对无模式数据存储的实际使用实践的经验有关。它们如何映射到面向对象的世界,以便您可以熟练地使用它,而不必太接近无模式存储的“裸机”? (在RelDB术语中,不使用太多SQL并直接弄乱表结构)
访问注定是非常通用的(例如SuRF“插入属性”是您可以访问数据的最高、最专业的级别) ,或者为特定商定的方便模式提供专门的类也是一种好方法,但是会带来使用大量类来访问新的和意外的关联数据的风险?
I am toying with RDF, and in particular how to access information stored in a rdf storage. The huge difference from a traditional relational database is the lack of a predefined schema: in a relational database, you know that table has those columns, and you can technically map each row to an instance of a class. The class has well defined methods, and well defined attributes.
In a schema-less system, you don't know what data is associated to a given information. It's like having a database table with an arbitrary and not predefined number of columns, and every row can have data in any number of these columns.
Similar to ObjectRelational Mappers, there are Object RDF mappers. RDFAlchemy and SuRF are the two I am playing right now. Basically, they provide you a Resource object, whose methods and attributes are provided dynamically. It kind of make sense... however, it's not that easy. In many cases, you prefer to have a well defined interface, and to have more control of what's going on when you set and get data on your model object. Having such a generic access makes things difficult, in some sense.
Another thing (and most important) I noted is that, even if in general, schema-less data are expected to provide arbitrary information about a resource, in practice you more or less know "classes of information" that tend to be together. Of course, you cannot exclude the presence of additional info, but this, in some cases, is the exception, rather than the norm, although the exception is sensible enough to be too disruptive for a strict schema. In a rdf representation of an article (e.g. like in RSS/ATOM feeds) you know the terms of your described resources, and you can map them to a well defined object. If you provide additional information, you can define an extended object (inherited from the base one) to provide accessors to the enhanced information. So in a sense, you deal with schema-less data by means of "schema oriented objects" you can extend when you want to see specific additional information you are interested about.
My question is relative to your experience about real world usage practices of schema-less data storage. How do they map to the object-oriented world so that you can use it proficiently and without going too near to the "bare metal" of the schema-less storage ? (in RelDB terms, without using too much SQL and directly messing with the table structure)
Is the access doomed to be very generic (e.g. SuRF "plugged-in attributes" is the highest, most specialized level you can have to access your data), or having specialized classes for specific agreed convenient schemas is also a good approach, introducing however the risk of having a proliferation of classes to access new and unexpected associated data ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我想我的简短回答是“不”。我是个有点白胡子的人,做过很多将 XML 数据映射到关系数据库的工作。如果您确实决定使用这样的数据库,则必须不断验证您的数据。您还需要非常严格的纪律,以避免数据库缺乏通用性。使用模式在这里会有所帮助,因为大多数 XML 模式都是面向对象的,因此是可扩展的,从而简化了分析的需要,以避免创建具有不同名称的相似数据,这将导致任何必须访问您的数据库的人对您产生邪恶的想法。
根据我的个人经验,如果您正在做一些网络数据库有意义的事情,那就去做吧。如果没有,您将失去关系数据库可以执行的所有其他操作,例如完整性检查、事务和集选择。然而,由于大多数人无论如何都使用关系数据库作为对象存储,我想这一点是没有意义的。
至于如何访问该数据,只需将其放入哈希表中即可。严重地。如果任何地方都没有模式,那么您将永远不知道那里有什么。如果您有一个模式,则可以使用它来生成访问器对象,但您获得的很少,因为您失去了底层存储的所有灵活性,同时获得了 DAO(数据访问对象)的不灵活性。
例如,如果您有一个哈希表,那么从 XML 解析器中获取值通常相当容易。您定义要使用的存储类型,然后遍历 XML 树并将值放入存储类型中,根据需要将类型存储在哈希表或列表中。但是,如果您使用 DAO,您最终将无法简单地扩展数据对象(这是 XML 的优势之一),并且您必须为执行以下操作的对象创建 getter 和 setter,
当然,您必须对该模式层中的每个值执行此操作,包括加载器和子层的定义。当然,如果您使用采用回调的更快解析器,您最终会陷入更大的混乱,因为您现在必须在生成结果树时跟踪您所在的对象。
我已经完成了所有这些工作,尽管我通常构造一个验证器,然后构造一个提供 XML 和数据类之间匹配的适配器,然后构造一个协调过程以使其与数据库协调一致。不过,几乎所有代码最终都会生成。如果您拥有 DTD,则可以生成大部分 Java 代码来访问它,并且具有合理的性能。
最后,我只是将自由形式、网络或分层数据保留为自由形式、网络或分层数据。
I guess my short answer would be "don't". I'm a bit of a greybeard, and have done a lot of mapping XML data into relational databases. If you do decide to use such a database, you're going to have to validate your data constantly. You'll also need very strict discipline in order to avoid having databases with little commonality. Using a schema helps here, as most XML schemas are object-oriented and thus extensible, easing the need for analysis to keep from creating similar data with dissimilar names, which will cause anyone who has to access your database to think evil thoughts about you.
In my personal experience, if you're doing the sorts of things where a networked database makes sense, go for it. If not, you lose all the other things relational databases can do, like integrity checking, transactions and set selecting. However, since most people use a relational database as an object store anyway, I guess the point is moot.
As for how to access that data, just put it in a Hashtable. Seriously. If there is no schema anywhere, then you'll never know what is in there. If you have a schema, you can use that to generate accessor objects, but you gain little, as you lose all the flexibility of the underlying store while simultaneously gaining the inflexibility of a DAO (Data Access Object).
For instance, if you have a Hashtable, getting the values out of an XML parser is often fairly easy. You define the storage types you're going to use, then you walk the XML tree and put the values in the storage types, storing the types in either a Hashtable or List as appropriate. If, however, you use a DAO, you end up not being able to trivially extend the data object, one of the strengths of XML, and you have to create getters and setters for the object that do
Except, of course, you have to do it for every single value in that schema layer, including loaders and definitions for sublayers. And, of course, you end up with a much bigger mess if you use the faster parsers that employ callbacks, as you now have to track which object your'e in as you produce the resultant tree.
I've done all this, although I normally construct a validator, then an adapter that provides the match between the XML and the data class, then a reconcile process to reconcile it to the database. Almost all the code ends up being generated, though. If you have the DTD, you can generate most of the Java code to access it, and do so with reasonable performance.
In the end, I'd simply keep freeform, networked or hierarchical data as freeform, networked or hierarchical data.
我想说,无模式 XML 文件的最佳实践是为其创建模式!
没有模式并不是特别好。这意味着除了检测文件是否是格式良好的 XML 之外,您无法以任何方式验证该文件。
文件没有任何语义似乎很可疑。因为那意味着你不知道自己应该、做了什么或将要投入什么。如果是这样的话,这听起来就像是在寻找问题的解决方案。
如果您因为还不了解模式语言而没有模式,请查看 DTD。这很简单。如果您的应用程序中有验证实用程序或验证解析器,您可以在大约一两个小时内学习并掌握它。
如果阻止您拥有架构的问题是您的架构规则似乎不适合您到目前为止所查看的架构定义文件类型,请不要担心。
虽然 DTD 甚至 XSD(XML 架构)文件有些不灵活,但还有其他更灵活的架构文件类型。它们也比 XSD 简单得多,相信我。
查看 RNC(RELAX NG,紧凑)模式文件规范。 RNC 文件对于人类来说非常容易读写。有一些 XML 编辑器可以理解它们。有一些实用程序可以在 RELAX NG 格式(RNG 或 RNC)和其他格式(如 DTD)之间来回转换和XSD。
上次我检查时,XHTML TR 包含一个非规范的 RNC 文件以帮助验证它,更不用说明确记录它了。 RELAX NG 可以灵活地做到这一点,并且您实际上可以在不成为 Borg 集体成员的情况下阅读它。在这种情况下,Borg 并不是委婉的说法 Microsoft。
如果您需要比 RELAX NG 更灵活的东西,请查看 Schematron。它是一种非常好的基于规则的模式验证语言。它不是很复杂。与其他模式语言一样,它也已经存在了很长时间,很成熟,并且是公认的标准。
甚至微软的一些高级工程师也对 XSD 抱有严重的疑虑。复杂度很高,事实证明它无法表达某些不那么奇怪的数据排列,它非常冗长,它混合了验证和默认值等问题。无论你在做什么,听起来都不太适合直接支持它。
RDF 映射器与 XSD 绑定工具一样,非常适合持久化对象,因为它们的类采用某种受支持的编程语言(如 Java)(例如使用 JAXB)。不过,目前尚不清楚您是否有一些想要首先坚持的课程。
有一些语义 Web 技术,例如 OWL 和 RDF,它们非常灵活且动态。
您可能想要查看的一个工具是斯坦福大学的 Protege。它非常强大并且非常灵活。它基本上是一个语义 Web IDE 和框架。后者是用 Java 编写的,工具也是如此。然而,Protege 创建和编辑的语义 Web 模式和数据文件可以由任何语言编写的程序使用。此类文件中没有对 Java 的偏见。
此外,您还可以使用 Swoogle 找到大量语义 Web 架构。无论您的应用程序是什么,可能已经有一个适合的模式。
基本上,一旦您知道要在 XML 数据文件中放入什么内容,用这些模式验证语言之一创建模式文件并不困难。如果你不知道,那么程序或人在阅读它时就不太可能知道如何处理它。如果是这种情况,XML 可能不是最好的存储表示形式。我不确定会发生什么。
相反,您可能只想使用通用的动态类型脚本语言(如 Python 或 Ruby)来完成您正在做的任何事情。如果您希望您的程序不仅能够拥有无限的数据格式,而且能够自行修改,也可以使用 LISP。
无模式数据存储的另一个选择是逻辑编程语言。这些通常没有任何模式。他们有一个本体。
我经常使用的两种使用本体的编程语言是 CLIPS 和 Prolog。两者都有免费、开源、跨平台的实现。
看看SWI-Prolog;快速、简单且功能强大。您可以在其中定义事实,以及在必要时基本上综合适当事实的规则。您通过查询提取数据。我记得,早在 1990 年代,Prolog 实际上是 RDF 创建时的灵感来源。最初的 RDF 文档经常引用 Prolog。如果您想“发现”或“分析”或“查找”本体中的事实,Prolog 是编写此类应用程序的非常好的语言。它对于自然语言解析也很方便。
如果您希望根据本体中的事实解决问题,那么 CLIPS 也很好。它非常适合组织、故障排除和配置相关的应用程序。
如果你不喜欢模式,那么本体论也许是你喜欢的。如果没有,也许您应该使用动态类型脚本语言,并使用标准持久性机制将使用映射和列表存储在复杂对象中的数据持久保存到文件中。
I would say the best practice for a schema-less XML file is to create a schema for it!
Having no schema is not particularly nice. It means you cannot validate the file in any way, other than to detect if it is well-formed XML or not.
Having no semantics to the file whatsoever seems fishy. Because that would mean that you do not know what you should, did, or will put into it. If that is the case, it sounds suspiciously like a solution in search of a problem.
If you have no schema because you do not yet know a schema language, take a look at DTD. It is very simple. You can learn and master it in about an hour or two, if you have a validation utility or validating parser in your application.
If the issue that is preventing you from having a schema is that your schema rules do not seem to fit schema definition file types you have looked at so far, fear not.
While DTD and even XSD (XML Schema) files are somewhat inflexible, there are other more flexible schema file types. They are much simpler than XSD too, trust me.
Take a look at the RNC (RELAX NG, compact) schema file spec. The RNC files are very easy for humans to read and write. There are some XML editors out there that understand them. There are utilities that will convert back and forth between RELAX NG format (RNG or RNC) and other formats like DTD and XSD.
Last time I checked, the XHTML TR included a non-normative RNC file for help in validating it, not to mention documenting it unambiguously. RELAX NG has the flexibility to do that, and you can actually read it without being part of the Borg collective. In this case Borg is not a euphemism Microsoft.
If you need something even more flexible than RELAX NG, take a glance at Schematron. It is a very nice rule-based schema validation language. It is not very complex. Like these other schema languages, it too has been around a long time, is mature, and is a recognized standard.
Even some senior engineers at Microsoft had grave misgivings about XSD. The complexity is high, it turns out to be unable to express certain not-so-odd data arrangements, it is very verbose, it mixes concerns such as validation and default values, and so on. Whatever you are doing, it does not sound very well suited towards directly supporting it.
RDF mappers, like XSD binding tools, are well suited towards persisting objects, given their classes in some supported programming language like Java (e.g. with JAXB). It is not clear you have some classes you want to persist in the first place, though.
There are some semantic web technologies out there like OWL and RDF which are flexible, and very dynamic.
One tool you might want to look at is Stanford's Protege. It is quite powerful and very flexible. It is basically a semantic web IDE and framework. The latter is written in Java, as is the tool. However, the semantic web schema and data files Protege creates and edits could be used by programs written in any language. There is no bias towards Java in such files.
Also, you can find lots of semantic web schemas by using Swoogle. There might be a schema already that fits whatever your application is.
Basically, coming up with a schema file in one of these many schema validation languages is not very hard once you know what you want to put in your XML data file. If you have no idea then it is unlikely a program or a person is going to know what to do with it when they read it. If that is the case, XML might not be the best storage representation. I am not sure anything would be.
Instead, you might simple want to do whatever you are doing in a general purpose, dynamically typed scripting language like Python or Ruby. LISP could also be used, if you want your programs to be able to not only have unlimited data formats but be able to modify themselves as well.
Another option for schema-less data storage is a logic programming language. These usually do not have any schema. They have an ontology instead.
Two programming languages I have worked a lot with that use ontologies are CLIPS and Prolog. There are free, open source, cross-platform, implementations of both available.
Take a look at SWI-Prolog; fast, simple, and powerful. You can define facts in it, and rules which basically synthesize apropos facts when necessary. You pull the data out with queries. Prolog was actually an inspiration for RDF when it was created, back in the 1990's, as I recall. The original RDF documentation used to make frequent references to Prolog. If you want to "discover" or "analyze" or "find" things about facts in your ontology, Prolog is a very good language for writing such applications. It is also handy for natural language parsing.
CLIPS is nice too, if you are looking to do problem-solving upon the facts in your ontology. It is well-suited towards organizing, troubleshooting, and configuration related applications.
If schemas are not your thing, perhaps ontologies are. If not, maybe you should just use a dynamically typed scripting language and persist data stored in complex objects using maps and lists into files using their standard persistence mechanisms.
我没有将无模式数据库与 OOP 相结合的经验,但我有一年的无模式数据库和脚本编写经验。
根据我的经验,它非常有用。我使用的数据库也是无类型的(所有任意字符串)。这带来了以下优点:
,你不需要太多的数据文档所以在我的例子中,无模式数据库与脚本一起非常有用并且取得了巨大的成功。
当您考虑将对象用于无模式数据库时,我会尝试通过将对象存储在哈希表中来保持自由。这将使您可以自由访问所有键值对 - 无论您选择哪个“对象”。它还使您可以根据需要自由添加新的键值。
如果您的对象(例如在 RSS 提要中)具有明确定义的基础,那么提出一个封装明确定义的基础但还具有某种哈希映射以供您自由的基础对象是有意义的。
一旦您发现越来越多的键值对成为“标准”,只需更新您的对象模型来封装它们 - 您的软件将演变成正确的数据结构。也许稍后将一些数据转移到传统的 RMDBS 更有意义。
不要过度设计 - 在需要时实现功能......
I have no experience with schema less DB combined with OOP, with I have year of experience with a schema less DB and scripting.
From my experience, it can be quite usefull. The DB I've used was also untyped (all arbitrary strings). This leads to the following advantages:
So in my case, the schema less DB together with the scripting was very usefull and a huge success.
When you think of using objects for the schema less DB, I would try to keep the freedom by storing the objects in a hashtable. This would give you the freedom to access all the key-value pairs - no matter which "object" you selected. It would also give you the freedom to add new key-values as needed.
If your objects (like in an RSS feed) have a well defined base, it makes sense to come up with a base objects which encapsulates the well defined base but also has some kind of hash map for your freedom.
As soon as you discover that more and more key-value pairs turn out to be "standard", just update your object model to encapsulate these - you software will evolve into the right data structure. May it makes even sense to move some of the data to a traditional RMDBS at a later time.
Don't over engineer - implement the features when needed...
使用 MongoDB 或其他 nosql 数据库。另请参阅此博客 为什么我认为 Mongo 之于数据库就像 Rails 之于框架。
Use MongoDB or other nosql databases. Also see this blog on, Why I think Mongo is to Databases what Rails was to Framework.