数据反规范化和 C# 对象 DB 序列化
我正在使用具有各种不同实体的数据库表。这意味着我不能在其中包含任意数量的字段来保存各种不同的实体。相反,我想只保存最重要的字段(日期、参考 ID - 各种其他表的外键、最重要的文本字段等)以及一个附加文本字段,我想在其中存储更完整的对象数据。
最明显的解决方案是使用 XML 字符串并存储它们。第二个最明显的选择是 JSON,它通常更短,并且序列化/反序列化可能也更快......并且可能也更快。但真的是这样吗?我的对象也不需要严格可序列化,因为 JsonSerializer 通常能够序列化任何内容。即使是匿名对象,也可以在这里使用。
解决此问题的最佳解决方案是什么?
其他信息
我的数据库是高度标准化的,并且我正在使用实体框架,但为了拥有外部超快速全文搜索功能,我牺牲了一点数据库非规范化。仅供参考,我在 MySql 之上使用 SphinxSE。 Sphinx 将返回行 ID,我将使用这些行 ID 来快速查询索引优化的综合表,以从中获取最重要的数据,这比查询整个数据库中的多个表要快得多。
我的表将包含如下列:
RowID
(自动增量)EntityID
(实际实体的 - 但不直接相关,因为这必须指向不同的表)EntityType
(这样我就可以在需要时获取实际实体)DateAdded
(将其添加到此表中时记录时间戳)Title
Metadata
(与特定实体类型相关的序列化数据)
该表将使用 SPHINX 索引器进行索引。当我使用此索引器搜索数据时,我将提供一系列 EntityID 和限制日期。索引器必须返回按 DateAdded
(降序)排序的非常有限的分页数量的 RowID
。然后,我只需将这些 RowID
连接到我的表中即可获得相关结果。所以这实际上不是全文搜索,而是过滤搜索。通过这种方式获取 RowID 会非常快,并且从表中获取结果比比较 EntityID 和 DateAdded 快得多,尽管它们会被正确索引。
I'm using a DB table with various different entities. This means that I can't have an arbitrary number of fields in it to save all kinds of different entities. I want instead save just the most important fields (dates, reference IDs - kind of foreign key to various other tables, most important text fields etc.) and an additional text field where I'd like to store more complete object data.
the most obvious solution would be to use XML
strings and store those. The second most obvious choice would be JSON
, that usually shorter and probably also faster to serialize/deserialize... And is probably also faster. But is it really? My objects also wouldn't need to be strictly serializable, because JsonSerializer is usually able to serialize anything. Even anonymous objects, that may as well be used here.
What would be the most optimal solution to solve this problem?
Additional info
My DB is highly normalised and I'm using Entity Framework, but for the purpose of having external super-fast fulltext search functionality I'm sacrificing a bit DB denormalisation. Just for the info I'm using SphinxSE on top of MySql. Sphinx would return row IDs that I would use to fast query my index optimised conglomerate table to get most important data from it much much faster than querying multiple tables all over my DB.
My table would have columns like:
RowID
(auto increment)EntityID
(of the actual entity - but not directly related because this would have to point to different tables)EntityType
(so I would be able to get the actual entity if needed)DateAdded
(record timestamp when it's been added into this table)Title
Metadata
(serialized data related to particular entity type)
This table would be indexed with SPHINX indexer. When I would search for data using this indexer I would provide a series of EntityIDs
and a limit date. Indexer would have to return a very limited paged amount of RowIDs
ordered by DateAdded
(descending). I would then just join these RowIDs
to my table and get relevant results. So this won't actually be full text search but a filtering search. Getting RowIDs
would be very fast this way and getting results back from the table would be much faster than comparing EntityIDs
and DateAdded
comparisons even though they would be properly indexed.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
将数据保存在 SQL 数据库中且不会导致长期痛苦的唯一方法是实际创建适当的、规范化的索引架构,并在向域对象添加新属性时根据需要扩展该架构。
请不要尝试将对象“序列化”到 SQL 数据库。如果这确实是您想要做的,那么您最好使用对象数据库,例如 db4o 。
更新:
这是我目前根据评论和问题更新理解问题空间的方式:
我对此的选择(按优先顺序排列)是:
使用 FTS 引擎的功能。
几乎每个 FTS 引擎(包括 Sphinx)都允许将自定义属性存储为每个“文档”的一部分。目前,您说您只存储行 ID,以便您可以加入到该表。如果您根本不加入,而是将此信息保留在全文索引本身中,您的结果会返回得更快。您可以在此处输入的内容有一些相当严格的限制,但如果您可以解决这些限制,那么这是您的最佳选择。
面向文档的数据库。
您说您甚至没有真正使用 Sphinx 的“全文”部分,您只是使用它来优化某些查询。那为什么不去掉中间人呢?您建议 JSON 作为序列化格式; MongoDB(仅引用一个选项)支持BSON 本地。您仍然可以在公共列上创建索引,但与 mysql 不同的是,它实际上理解 BSON 格式,并且能够比 JSON 或关系数据库中的 XML 字符串。如果您无论如何都要进行非规范化,您可以自由选择您想要的任何存储库;选择最适合您的特定要求的数据库。
单表继承。
这是一种常见的设计,为了映射的简单性而牺牲了标准化。就您而言,整个目标是非规范化,因此这是一笔不错的交易。如果有数百列,这不是一个好的选择,但对于 10 或 20 列,这会很好,它将您的数据保留为“数据”,并且不会对性能产生任何重大影响方式。
XML 列。
这种方法的优点是数据不是不透明的。它实际上在数据库上下文中有意义。如果您必须将此信息存储在 mysql 数据库中 - 大概您希望运行一些临时查询 - 那么您不妨以 mysql 可以的格式存储它其实明白。另一方面,如果您 100% 确信您永远不需要“反序列化”此数据直到它到达您的应用程序,那么我可能会选择...
< strong>自定义二进制序列化格式。
如果您必须将数据存储在您的mysql数据库中并且您确信您永远不需要为其建立索引甚至从查询中读取其内容,然后就不会在臃肿的文本编码上浪费宝贵的 I/O。与二进制相比,即使是 JSON 也显得臃肿,因为 JSON 必须存储所有属性名称;如果您自己进行序列化,则可以使用一两个字节来确定类型,然后以已知的顺序反序列化剩余的字段/属性。只有数据,没有元数据。
我什至不会在这里使用 .NET
BinaryFormatter
,我会创建自己的高度优化版本。毕竟,这需要快快快!并且进入表中的每个额外字节都会使查询变慢。您甚至可以使用 GZip 压缩某些数据,具体取决于其中的内容。除非我还没有完全理解你的要求,否则我不会考虑任何其他选择。
The only means of saving data in a SQL database that will not lead to long-term pain is to actually create a proper, normalized, indexed schema, and extend that schema as necessary when you add new properties to your domain objects.
Please do not attempt to "serialize" objects to a SQL database. If that is really what you want to do, you are better off using an object database such as db4o instead.
Update:
This is how I currently understand the problem space, based on comments and question updates:
My choices for this, in order of preference, would be:
Use the FTS engine's features.
Almost every FTS engine, including Sphinx, allows custom attributes to be stored as part of each "document." Presently you say you are only storing the Row ID so you can join to this table. Your results will come back a lot faster if you don't join at all, and instead keep this information inside the fulltext index itself. There are some pretty hard limits to what you can put in here, but if you can work around the limitation, it's your best option.
Document-Oriented Database.
You say you're not even really using the "Full-Text" part of Sphinx, you're just using it to optimize certain queries. Why not cut out the middle man then? You suggested JSON as a serialization format; MongoDB (to cite just one option) supports BSON natively. You can still create indexes on the common columns, but unlike mysql, it actually understands the BSON format, and is able to store that data a lot more efficiently than a JSON or XML string in a relational database. If you're denormalizing anyway, you have the freedom to choose any repository you want; choose the one that's best-optimized for your particular requirement.
Single-table inheritance.
This is a common design that trades off normalization for mapping simplicity. In your case, the entire objective is denormalization, so it's a good trade. This isn't a good option if there will be hundreds of columns, but for 10 or 20, this will be fine, it keeps your data as "data" and shouldn't impact performance in any significant way.
XML columns.
The advantage to this approach is that the data is not opaque. It's actually meaningful in the context of the database. If you must store this information in the mysql database - presumably you expect some ad-hoc queries to be run - then you might as well store it in a format that mysql can actually understand. On the other hand, if you're 100% positive that you'll never need to "deserialize" this data until it hits your application, then I would probably go with...
Custom binary serialization format.
If you must store the data in your mysql database and you know for sure that you'll never need to index it or even read its contents from within a query, then don't waste precious I/O on bloated text encoding. Even JSON is bloated compared to binary, because JSON has to store all the property names; if you do your own serialization, you can use one or two bytes to determine the type and then deserialize the remaining fields/properties in a known order. Just the data, no metadata.
I wouldn't even use the .NET
BinaryFormatter
here, I would create my own heavily-optimized version. After all, this needs to be fast fast fast! and every extra byte that goes into your table makes the query slower. You might even be able to GZip compress some of the data, depending on exactly what's in there.Unless I still haven't fully understand your requirements, I wouldn't even consider any other options.
不要这样做。这是一个坏主意。
如果您确实必须这样做,我会使用 XML。例如,只要字段是 XML 类型,SQL Server 就允许您针对 XMl 进行查询。
从某种意义上说,您正在做面向对象数据库正在做的事情。它们已经失宠,因为在大多数情况下,ORM 工具(例如 Hibernate 和 Microsoft 的实体框架)允许您充分利用 OO 和关系世界。
Don't do it. It is a bad idea.
If you really must do it, I would use XML. SQL Server, for example, lets you query against the XMl as long as the field is an XML type.
In a sense, you are doing what object-oriented databases were doing. They have fallen out of favor because, for the most part, ORM tools, like Hibernate and Microsoft's Entity Framework, allow you to have the best of both the OO and relational worlds.
您看过 NoSql 数据库吗?
http://nosql-database.org/
否则,我不得不说你的数据模型可能会导致让你在未来遇到困难...
Have you looked at NoSql databases?
http://nosql-database.org/
Otherwise, I'd have to say that your datamodel could lead you into difficulties in the future...
实际上,我一直在 RDBMS 中使用文本 blob。当用于正确的目的时,它可以对性能产生积极的影响,并节省许多表的存在和维护以及开发时间。当您需要存储有关行的频繁更改的非关系元数据时,它是理想的选择。
尽管我只会考虑对 KVO 对象使用文本 blob(即非实体 - 仅对其持久化的行有意义的对象)。如果您需要用它进行任何服务器端操作(即查询等),也不必费心。
对于那些感兴趣的人,我开发了一个快速、有弹性的类型序列化器,非常适合以紧凑、人类可读的文本格式存储文本 blob。
I actually make use of text blobs in my RDBMS all the time. When used for the right purpose it can have a positive impact on performance and saves the existence and maintenance of many tables and development time. Its ideal when you need to store frequently changing, non-relational metadata about a row.
Though I would only consider using text blobs for KVO objects (i.e. non-entity - objects that is only meaningful with the row that its persisted with). Also don't bother if you need to do any server-side manipulation with it, i.e. querying etc.
For those that are interested I've developed a fast, resilient Type Serializer that is ideal for storing text-blobs in a compact, human-readable text-format.