处理大量文本字符串
我的项目在运行时,会在短时间内收集大量的字符串文本块(大约20K,我见过的最大的大约200K)并将它们存储在关系数据库中。每个字符串文本都相对较小,平均约为 15 行短行(约 300 个字符)。当前的实现是用C# (VS2008)、.NET 3.5 和后端DBMS 是Ms. SQL Server 2005
性能和存储都是该项目的重要关注点,但优先级将是性能第一,然后是存储。我正在寻找这些问题的答案:
- 我应该在将文本存储到数据库之前压缩文本吗?或者让 SQL Server 担心压缩存储?
- 您知道在这种情况下使用哪种最佳压缩算法/库可以为我提供最佳性能吗?目前我只在 .NET 框架中使用标准 GZip
- 你知道处理这个问题的最佳实践吗?我欢迎开箱即用的建议,只要它可以在 .NET 框架中实现? (这是一个大项目,这个要求只是其中的一小部分)
编辑:我将继续添加此内容以澄清提出的观点,
- 我不需要文本索引或对这些文本进行搜索。我只需要能够在稍后阶段检索它们,以便使用其主键将其显示为文本块。
- 我有一个按上述方式实现的工作解决方案,并且 SQL Server 处理它完全没有问题。该程序将经常运行,并且需要使用大数据上下文,因此您可以想象大小会非常快速地增长,因此我可以做的每一项优化都会有所帮助。
My project when it is running, will collect a large number of string text block (about 20K and largest I have seen is about 200K of them) in short span of time and store them in a relational database. Each of the string text is relatively small and the average would be about 15 short lines (about 300 characters). The current implementation is in C# (VS2008), .NET 3.5 and backend DBMS is Ms. SQL Server 2005
Performance and storage are both important concern of the project, but the priority will be performance first, then storage. I am looking for answers to these:
- Should I compress the text before storing them in DB? or let SQL Server worry about compacting the storage?
- Do you know what will be the best compression algorithm/library to use for this context that gives me the best performance? Currently I just use the standard GZip in .NET framework
- Do you know any best practices to deal with this? I welcome outside the box suggestions as long as it is implementable in .NET framework? (it is a big project and this requirements is only a small part of it)
EDITED: I will keep adding to this to clarify points raised
- I don't need text indexing or searching on these text. I just need to be able to retrieve them in later stage for display as a text block using its primary key.
- I have a working solution implemented as above and SQL Server has no issue at all handling it. This program will run quite often and need to work with large data context so you can imagine the size will grow very rapidly hence every optimization I can do will help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
每个字符串平均有 300 个字符。这是 300 或 600 字节,具体取决于 Unicode 设置。假设您使用
varchar(4000)
列,并且每个列(平均)使用 300 个字节。然后您最多可以将其中的 200,000 个存储在数据库中。
这不到 60 MB 的存储空间。坦率地说,在数据库领域,这只是微不足道的事情。 60 GB 存储空间是我所说的“中型”数据库。
此时,即使考虑压缩也是不成熟的优化。 SQL Server 可以毫不费力地处理如此大量的文本。除非您没有提到任何系统限制,否则我不会关心任何这些问题,除非您确实开始看到性能问题 - 即使这样,它也可能是其他原因造成的,例如糟糕的索引策略。
压缩某些类型的数据,尤其是非常少量的数据(300 字节肯定很小),实际上有时会产生更糟糕的结果。您最终可能会得到实际上比原始数据更大的“压缩”数据。我猜大多数时候,压缩后的大小可能非常接近原始大小。
SQL Server 2008 可以执行页级压缩,这将是一种更有用的优化,但您使用的是 SQL Server 2005。所以不,绝对不要费心尝试压缩单个值或< em>行,这不值得付出努力,而且实际上可能会让事情变得更糟。
The strings are, on average, 300 characters each. That's either 300 or 600 bytes, depending on Unicode settings. Let's say you use a
varchar(4000)
column and use (on average) 300 bytes each.Then you have up to 200,000 of these to store in a database.
That's less than 60 MB of storage. In the land of databases, that is, quite frankly, peanuts. 60 GB of storage is what I'd call a "medium" database.
At this point in time, even thinking about compression is premature optimization. SQL Server can handle this amount of text without breaking a sweat. Barring any system constraints that you haven't mentioned, I would not concern myself with any of this until and unless you actually start to see performance problems - and even then it will likely be the result of something else, like a poor indexing strategy.
And compressing certain kinds of data, especially very small amounts of data (and 300 bytes is definitely small), can actually sometimes yield worse results. You could end up with "compressed" data that is actually larger than the original data. I'm guessing that most of the time, the compressed size will probably be very close to the original size.
SQL Server 2008 can perform page-level compression, which would be a somewhat more useful optimization, but you're on SQL Server 2005. So no, definitely don't bother trying to compress individual values or rows, it's not going to be worth the effort and may actually make things worse.
如果您可以升级到 SQL Server 2008,我建议您仅打开页面压缩,详细信息如下:http://msdn.microsoft.com/en-us/library/cc280449.aspx
例如,您可以像这样创建一个压缩表:
如果您不能在数据库中使用压缩,不幸的是,您的字符串(不超过 300 个字符)不值得使用
System.IO.Compression
之类的内容进行压缩。不过我想你可以尝试一下。If you can upgrade to SQL Server 2008, I would recommend just turning on page compression, as detailed here: http://msdn.microsoft.com/en-us/library/cc280449.aspx
As an example, you can create a compressed table like this:
If you can't use compression in the database, unfortunately your strings (no more than 300 chars) are not going to be worthwhile to compress using something like
System.IO.Compression
. I suppose you could try it, though.压缩会消耗资源,并且通常会损害性能,因为大量时间只是本地通信和处理。
Compression will consume resources and typically will hurt performance where significant time is just local communication and processing.
不完全清楚你在问什么。
关于性能 - 如果您在将字符串存储到数据库之前先压缩内存中的字符串,那么您的程序将会比直接将数据填充到表中并让 SQL 稍后处理的情况慢。代价是 SQL 数据库会更大,但 1Tb 硬盘很便宜,那么存储真的有那么重要吗?
根据你的数字(200K x 300 字节),你只谈论了大约 60Meg。这不是一个非常大的数据集。您是否考虑过使用 ADO.NET 中的批量复制功能 (http://msdn .microsoft.com/en-us/library/7ek5da1a.aspx)。如果你所有的数据都放在一张表中,这应该很有趣。
这将是使用 EF 等生成 200K 插入语句的替代方案。
更新
这是另一个示例: http://weblogs.sqlteam.com /mladenp/archive/2006/08/26/11368.aspx
Not entirely clear on what you are asking.
In regard to performance - if you are compressing the strings in memory before storing them in the database your program is going to be slower than if you just stuff the data straight in to the table and let SQL worry about it later. Trade off is that the sql database will be larger, but 1Tb hard drives are cheap so is storage really that big a deal?
Based on your numbers (200K by 300 bytes) you are only talking about roughly 60Megs. That is not a very large dataset. Have you considered using the Bulk Copy feature in ADO.NET (http://msdn.microsoft.com/en-us/library/7ek5da1a.aspx). If all over you data goes in one table this should be fun.
This would be an alternative to having something like EF generating essentially 200K insert statements.
UPDATE
Here is another example: http://weblogs.sqlteam.com/mladenp/archive/2006/08/26/11368.aspx
我不担心压缩它们。对于这种大小的字符串(300 个字符左右),它会让人头疼,而不是值得。压缩字符串需要时间(无论多短),并且 SQL Server 2005 没有执行此操作的本机方法,这意味着您必须编写一些东西来执行此操作。如果在应用程序中执行此操作会损害性能,则可以编写一个 CLR 例程在数据库中执行此操作,但在应用程序中实际使用压缩字符串(或任何其他使用它的人)。
数据库中的空间很便宜,因此通过压缩所有字符串并不能真正节省太多。您最大的问题是在应用程序的内存中保留大量字符串。如果您经常返回数据库来加载其中一些数据,而不是尝试同时缓存所有数据,那么我不会担心它,除非您确实看到了问题。
I wouldn't worry about compressing them. For strings this size (300 characters or so), it's going to be more of a headache than it's worth. Compressing strings takes time (no matter how small), and SQL server 2005 does not have a native way of doing this, which means that you are going to have to write something to do it. If you do this in the application that is going to hurt your performance, you could write a CLR routine to do it in the database, but it is still going to be an extra step to actually use the compressed string in your application (or any other that uses it for that matter).
Space in a database is cheap, so you aren't really saving much by compressing all the strings. Your biggest problem is going to be keeping a large number of strings in your application's memory. If you are routinely going back to the database to load some of them and not trying to cache all of them at the same time, I wouldn't worry about it unless you are actually seeing problems.
听起来您会受益于使用大值数据类型
这些数据类型将存储最多 2^31-1 字节的数据
如果所有字符串都很小,则通过压缩它们获得的回报会递减。如果没有原生 SQL 压缩,那么即使您压缩它们,它们也将无法被搜索。
Sounds like you would benefit from using Large-Value Data Types
These data types will store up to 2^31-1 bytes of data
If all of your strings are smallish, there is a diminishing return to be gained by compressing them. Without natuve SQL compression, they will not be searchable anyway if you compress them.
听起来您正在尝试使用关系数据库解决一个绝对非关系的问题。你究竟为什么使用数据库?当然可以,但有些问题就是不太适合。 TFS 表明,一旦您在 RDBS 上投入足够的硬件,您就可以通过暴力方式解决问题,但这并不是一个好主意。
It sound like you are trying to solve a definitely non-relational problem with a relational database. Why exactly are you using a database? It can be done of course, but some problems just don't fit well. TFS shows that you can brute force a problem into using a RDBS once you throw enough hardware on it, but that doesn't make it a good idea.