C# database-design database full-text-search

数据库在唯一列上搜索：长字符串还是许多 int 或 long？

发布于 2024-10-01 09:43:41 字数 710 浏览 1 评论 0原文

我不确定我将使用什么数据库（更可能是 SQL Server Express），所以我不知道这是否会产生影响（或那么多差异）。

基本上我希望将我的对象存储在数据库中，以便我可以搜索唯一的对象。

public class FooBar
{
    public GridItem[,]  Items { get; set; } //This is a 5x4 grid
}

public enum GridItem
{
    a = 0,
    b,
    c
}

起初，我将每个 GridItem 表示为 2 个字符的二进制文件（A = 00、B = 01、C = 10 - 我认为这不会让我的应用程序陷入从数组构建字符串的困境），这给了我一个 40 个字符的字符串。我可以在数据库中搜索这个字符串来匹配，但这让我思考。将每个 GridItem 保留为 Int32（或 Int64）并搜索数据库以查看所有列（GItem00、GItem01、... GItem54）是否与其相应的行/列 GridItem 匹配是否更有效？我认为 Int32 与 Int64 可能与处理器有关，所以这没什么大不了的。基本上，如果速度是我的第一要务（而不是存储），哪个更好......吐出 80 个字符串或将 20 个不同的 Int32 存储到数据库中并在这些列上搜索？

或者，是否有更好的东西，例如将对象序列化为二进制文件并以某种方式能够搜索匹配的 blob？我不是一个真正的数据库专家，所以我不知道。

原文

I'm not sure what Database I will be going with (more likely SQL Server Express), so I don't know if that makes a difference (or that much of a difference) to matter.

Basically I am looking to store my object in a database so I can search for a unique object.

public class FooBar
{
    public GridItem[,]  Items { get; set; } //This is a 5x4 grid
}

public enum GridItem
{
    a = 0,
    b,
    c
}

At first I represented each GridItem as a 2 character binary (A = 00, B = 01, C = 10 -- I don't think this bogged my application that much building the string from the array) which gave me a 40 character string. I can search for this string in the database to match, but it got me thinking. Is it more efficient to leave each GridItem as an Int32 (or Int64) and search the database to see if ALL columns (GItem00, GItem01, ... GItem54) matches up to their appropriate row/column GridItem. I think the Int32 vs Int64 would probably have to do with the processor, so that's not as big of a deal. Basically, if speed is my #1 concern (not storage) which is better... spit out an 80 character string or store 20 different Int32's into the Database and search on those columns?

Or, is there something even better, such as serializing the object into a binary and somehow being able to search for a matching blob? I'm not really a database guy, so I have no idea.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

诗酒趁年少 2024-10-08 09:43:41

我以前没有遇到过这样的问题，但我有一些关于更好的速度的理论。

当系统将数据持久化为40字节字符并且其上有索引时，索引会尽可能短，足以区分数据的准确记录位置。例如：换句话说

0101101.... => 010(3-byte index)
0111111.... => 011(3-byte index)

，当系统将数据持久化为 8 字节（Int64）整数并且其上有索引时，索引应该恰好是每条记录 8 字节。

在通用数据库理论中，使用的存储越少，查询性能获得的性能就越高。

如果您的数据足够多，数据库需要所有字符（40 字节字符）来索引记录，则某些记录的索引大小将是 40 字节。
正如所解释的，无论数据增长，8 字节整数索引仍然保持在 8 字节。

上述理论有一个前提：匹配的数据应该只占全部数据的一小部分。

对于索引维护的工作量，有一个重要因素需要关注：您需要 20 个索引（逻辑上）来加速 20 个 Int32 的策略。事实上，80 字符策略和单个 Int64 策略只需要一个索引。

我们来解释一下如果索引不起作用，这意味着数据库系统使用全表扫描（FTS）策略执行查询。

我们假设 40 字节（字符）数据被持久化为每条记录 40 字节，SQL Server 中的每个页面可以容纳 8K * 1024 / 40 = 204 条记录。

对于每条记录 8 字节的 8 字节（Int64）数据，SQL Server 中的每个页面可以容纳 8K * 1024 / 8 = 1024 条记录。

如果您有 20000 条记录，则数据库需要 20000 / 204 = 99 个 I/O 来执行 FTS，而另一条记录则需要 20000 / 1024 = 20 个 I/O。

需要的 I/O 越少，获得的性能就越高。

I have not faced such issue before, but I have some theories about better speed.

When system persist data as 40-byte characters and there is a index on it, the index would be as short as enough to distinguish the exactly record poisition of data. For example:

0101101.... => 010(3-byte index)
0111111.... => 011(3-byte index)

In another way, when system persist data as 8-byte(Int64) integer and there is a index on it, the index should be exactly 8 bytes per record.

In generic database theory, the less storage used, the more query performance gained.

If your data is much enough that database need all of the characters(40-byte character) to index the record, the size of index would be 40-byte on some records.
And the 8-byte integer index, as explained, still stay in 8 bytes however data grown.

There is a precondition in above theory: the matched data should only occupy as a small part of all.

There is a significant factor to concern for effort of index maintenance: You need 20 indexes(logically) to speed up the 20 Int32's strategy. There is only one index, indeed, needed for 80-character strategy and for single Int64 strategy.

Let's explain if the index doesn't work, which means the database system execute query using full-table-scan(FTS) strategy.

We assume the 40-byte(character) data is persisted as 40 bytes per record, every page in SQL Server can hold 8K * 1024 / 40 = 204 records.

For 8-byte(Int64) data with 8 bytes per record, every page in SQL Server can hold 8K * 1024 / 8 = 1024 records.

If you have 20000 records, database need 20000 / 204 = 99 I/O's to perform FTS, and 20000 / 1024 = 20 I/O's for the other one.

The less I/O's needed, the more performance gained.

回复收藏 0 原文