数据反规范化和 C# 对象 DB 序列化

发布于 2024-08-29 01:30:55 字数 1085 浏览 18 评论 0原文

我正在使用具有各种不同实体的数据库表。这意味着我不能在其中包含任意数量的字段来保存各种不同的实体。相反，我想只保存最重要的字段（日期、参考 ID - 各种其他表的外键、最重要的文本字段等）以及一个附加文本字段，我想在其中存储更完整的对象数据。

最明显的解决方案是使用 XML 字符串并存储它们。第二个最明显的选择是 JSON，它通常更短，并且序列化/反序列化可能也更快......并且可能也更快。但真的是这样吗？我的对象也不需要严格可序列化，因为 JsonSerializer 通常能够序列化任何内容。即使是匿名对象，也可以在这里使用。

解决此问题的最佳解决方案是什么？

其他信息

我的数据库是高度标准化的，并且我正在使用实体框架，但为了拥有外部超快速全文搜索功能，我牺牲了一点数据库非规范化。仅供参考，我在 MySql 之上使用 SphinxSE。 Sphinx 将返回行 ID，我将使用这些行 ID 来快速查询索引优化的综合表，以从中获取最重要的数据，这比查询整个数据库中的多个表要快得多。

我的表将包含如下列：

RowID （自动增量）
EntityID （实际实体的 - 但不直接相关，因为这必须指向不同的表）
EntityType（这样我就可以在需要时获取实际实体）
DateAdded（将其添加到此表中时记录时间戳）
Title
Metadata （与特定实体类型相关的序列化数据）

该表将使用 SPHINX 索引器进行索引。当我使用此索引器搜索数据时，我将提供一系列 EntityID 和限制日期。索引器必须返回按 DateAdded（降序）排序的非常有限的分页数量的 RowID。然后，我只需将这些 RowID 连接到我的表中即可获得相关结果。所以这实际上不是全文搜索，而是过滤搜索。通过这种方式获取 RowID 会非常快，并且从表中获取结果比比较 EntityID 和 DateAdded 快得多，尽管它们会被正确索引。

原文

I'm using a DB table with various different entities. This means that I can't have an arbitrary number of fields in it to save all kinds of different entities. I want instead save just the most important fields (dates, reference IDs - kind of foreign key to various other tables, most important text fields etc.) and an additional text field where I'd like to store more complete object data.

the most obvious solution would be to use XML strings and store those. The second most obvious choice would be JSON, that usually shorter and probably also faster to serialize/deserialize... And is probably also faster. But is it really? My objects also wouldn't need to be strictly serializable, because JsonSerializer is usually able to serialize anything. Even anonymous objects, that may as well be used here.

What would be the most optimal solution to solve this problem?

Additional info

My DB is highly normalised and I'm using Entity Framework, but for the purpose of having external super-fast fulltext search functionality I'm sacrificing a bit DB denormalisation. Just for the info I'm using SphinxSE on top of MySql. Sphinx would return row IDs that I would use to fast query my index optimised conglomerate table to get most important data from it much much faster than querying multiple tables all over my DB.

My table would have columns like:

RowID (auto increment)
EntityID (of the actual entity - but not directly related because this would have to point to different tables)
EntityType (so I would be able to get the actual entity if needed)
DateAdded (record timestamp when it's been added into this table)
Title
Metadata (serialized data related to particular entity type)

This table would be indexed with SPHINX indexer. When I would search for data using this indexer I would provide a series of EntityIDs and a limit date. Indexer would have to return a very limited paged amount of RowIDs ordered by DateAdded (descending). I would then just join these RowIDs to my table and get relevant results. So this won't actually be full text search but a filtering search. Getting RowIDs would be very fast this way and getting results back from the table would be much faster than comparing EntityIDs and DateAdded comparisons even though they would be properly indexed.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

若相惜即相离 2024-09-05 01:30:55

将数据保存在 SQL 数据库中且不会导致长期痛苦的唯一方法是实际创建适当的、规范化的索引架构，并在向域对象添加新属性时根据需要扩展该架构。

请不要尝试将对象“序列化”到 SQL 数据库。如果这确实是您想要做的，那么您最好使用对象数据库，例如 db4o 。

更新：

这是我目前根据评论和问题更新理解问题空间的方式：

表结构已经规范化；
全文引擎（Sphinx）用于优化某些搜索；
这里讨论的具体“序列化”数据将用作各个搜索结果的摘要或预览，并不代表完整的对象数据。

我对此的选择（按优先顺序排列）是：

使用 FTS 引擎的功能。
几乎每个 FTS 引擎（包括 Sphinx）都允许将自定义属性存储为每个“文档”的一部分。目前，您说您只存储行 ID，以便您可以加入到该表。如果您根本不加入，而是将此信息保留在全文索引本身中，您的结果会返回得更快。您可以在此处输入的内容有一些相当严格的限制，但如果您可以解决这些限制，那么这是您的最佳选择。
面向文档的数据库。
您说您甚至没有真正使用 Sphinx 的“全文”部分，您只是使用它来优化某些查询。那为什么不去掉中间人呢？您建议 JSON 作为序列化格式； MongoDB（仅引用一个选项）支持BSON 本地。您仍然可以在公共列上创建索引，但与 mysql 不同的是，它实际上理解 BSON 格式，并且能够比 JSON 或关系数据库中的 XML 字符串。如果您无论如何都要进行非规范化，您可以自由选择您想要的任何存储库；选择最适合您的特定要求的数据库。
单表继承。
这是一种常见的设计，为了映射的简单性而牺牲了标准化。就您而言，整个目标是非规范化，因此这是一笔不错的交易。如果有数百列，这不是一个好的选择，但对于 10 或 20 列，这会很好，它将您的数据保留为“数据”，并且不会对性能产生任何重大影响方式。
XML 列。
这种方法的优点是数据不是不透明的。它实际上在数据库上下文中有意义。如果您必须将此信息存储在 mysql 数据库中 - 大概您希望运行一些临时查询 - 那么您不妨以 mysql 可以的格式存储它其实明白。另一方面，如果您 100% 确信您永远不需要“反序列化”此数据直到它到达您的应用程序，那么我可能会选择...
< strong>自定义二进制序列化格式。
如果您必须将数据存储在您的mysql数据库中并且您确信您永远不需要为其建立索引甚至从查询中读取其内容，然后就不会在臃肿的文本编码上浪费宝贵的 I/O。与二进制相比，即使是 JSON 也显得臃肿，因为 JSON 必须存储所有属性名称；如果您自己进行序列化，则可以使用一两个字节来确定类型，然后以已知的顺序反序列化剩余的字段/属性。只有数据，没有元数据。
我什至不会在这里使用 .NET BinaryFormatter，我会创建自己的高度优化版本。毕竟，这需要快快快！并且进入表中的每个额外字节都会使查询变慢。您甚至可以使用 GZip 压缩某些数据，具体取决于其中的内容。

除非我还没有完全理解你的要求，否则我不会考虑任何其他选择。

The only means of saving data in a SQL database that will not lead to long-term pain is to actually create a proper, normalized, indexed schema, and extend that schema as necessary when you add new properties to your domain objects.

Please do not attempt to "serialize" objects to a SQL database. If that is really what you want to do, you are better off using an object database such as db4o instead.

Update:

This is how I currently understand the problem space, based on comments and question updates:

The table structure is already normalized;
A fulltext engine (Sphinx) is being used to optimize certain searches;
The specific "serialized" data being discussed here will be used as summaries or previews for individual search results, and does not represent the complete object data.

My choices for this, in order of preference, would be:

Use the FTS engine's features.
Almost every FTS engine, including Sphinx, allows custom attributes to be stored as part of each "document." Presently you say you are only storing the Row ID so you can join to this table. Your results will come back a lot faster if you don't join at all, and instead keep this information inside the fulltext index itself. There are some pretty hard limits to what you can put in here, but if you can work around the limitation, it's your best option.
Document-Oriented Database.
You say you're not even really using the "Full-Text" part of Sphinx, you're just using it to optimize certain queries. Why not cut out the middle man then? You suggested JSON as a serialization format; MongoDB (to cite just one option) supports BSON natively. You can still create indexes on the common columns, but unlike mysql, it actually understands the BSON format, and is able to store that data a lot more efficiently than a JSON or XML string in a relational database. If you're denormalizing anyway, you have the freedom to choose any repository you want; choose the one that's best-optimized for your particular requirement.
Single-table inheritance.
This is a common design that trades off normalization for mapping simplicity. In your case, the entire objective is denormalization, so it's a good trade. This isn't a good option if there will be hundreds of columns, but for 10 or 20, this will be fine, it keeps your data as "data" and shouldn't impact performance in any significant way.
XML columns.
The advantage to this approach is that the data is not opaque. It's actually meaningful in the context of the database. If you must store this information in the mysql database - presumably you expect some ad-hoc queries to be run - then you might as well store it in a format that mysql can actually understand. On the other hand, if you're 100% positive that you'll never need to "deserialize" this data until it hits your application, then I would probably go with...
Custom binary serialization format.
If you must store the data in your mysql database and you know for sure that you'll never need to index it or even read its contents from within a query, then don't waste precious I/O on bloated text encoding. Even JSON is bloated compared to binary, because JSON has to store all the property names; if you do your own serialization, you can use one or two bytes to determine the type and then deserialize the remaining fields/properties in a known order. Just the data, no metadata.
I wouldn't even use the .NET BinaryFormatter here, I would create my own heavily-optimized version. After all, this needs to be fast fast fast! and every extra byte that goes into your table makes the query slower. You might even be able to GZip compress some of the data, depending on exactly what's in there.