是否有任何（C＃中的）方式（在C＃中）将字符串转换为较小的字符串，当我说较小时，我的意思是“长度缩小＆quot”？

发布于 2025-01-22 01:24:21 字数 2048 浏览 3 评论 0原文

让我解释一下：在我的用例中，系统会给我许多字符串，这些字符串的大小可能会有所不同（字符数；长度），有时可能真的很大！问题是我必须将此字符串保存在“ SQL Server”数据库表的列中，坏消息是我不允许在此数据库中进行任何迁移，好消息是该列已经已经具有键入nvarchar（max）。

我以前做过一些研究，并遵循以下帖子，使用“ Gzip”和“ Brotli”编写数据压缩机。

https://khalidabuhakmeh.com/khalidabuhakmeh.com/ csharp

var value = "hello world";
var level = CompressionLevel.SmallestSize;

var bytes = Encoding.Unicode.GetBytes(value);
await using var input = new MemoryStream(bytes);
await using var output = new MemoryStream();

// GZipStream with BrotliStream
await using var stream = new GZipStream(output, level);

await input.CopyToAsync(stream);

var result = output.ToArray();
var resultString = Convert.ToBase64String(result);

实施转换方法后，我创建了测试，以生成不同尺寸（长度）的随机字符串以验证压缩机的性能，此时，我注意到了以下内容。 “ gzip”和“ brotli”首先转换为字节[]（字节数组），然后应用压缩，该压缩将结果向量向量（字节阵列）缩小尺寸，然后将结果（字节[]）转换为一个基本64字符串，在100％测试中，它具有比初始字符串更多的字符（长度）。

我的随机字符串生成器：

var rd_char = new Random();
var rd_length = new Random();
var wordLength = rd_length.Next(randomWordParameters.WordMinLength, randomWordParameters.WordMaxLength);
var sb = new StringBuilder();
int sourceNumber;
for (int i = 0; i < wordLength; i++)
{
    sourceNumber = rd_char.Next(randomWordParameters.CharLowerBound, randomWordParameters.CharUpperBound);
    sb.Append(Convert.ToChar(sourceNumber));
}
var word = sb.ToString();

我的示例字符串并不完全包含手头案件的完美表示，但我相信它们足够好。这是字符串生成器方法，实际上它在给定的尺寸范围内生成了完全随机的字符串，我在传递给convert.tochar（）方法的33〜127值提供的测试字符中使用。系统提供的字符串是JSON格式的，实际上它们是URL列表（具有数万个URL），URL通常具有随机的字符序列，因此我试图生成尽可能随机的字符串。

事实是，考虑到我试图将最初（压缩之前）的字符串保存在数据库中的情况，该字符串大于列中允许的最大尺寸（长度），当保存在数据库中时，即将进行的“数据”在所讨论的表的列中是结果“基本64”字符串在压缩后生成的，而不是缩小的大小向量（字节数组），我相信数据库将拒绝字符串（基本64），因为它的长度（数字为数字）字符）大于原始字符串的长度。

因此，这是我的问题，是否有任何（可逆的）将字符串转换为较小的方法，当我说较小时，我的意思是“长度减少”？似乎“ Gzip”或“ Brotli”并不能解决问题。

PS：我确保多次强调“长度”一词，以明确说明在文本中，我在谈论字符的数量，而不是记忆中的长度，因为我在几个论坛中注意到了我阅读的几个论坛以前，这种混乱使得很难得出结论。

原文

Let me explain: in my use case a system gives me many strings that can vary in size (number of characters; length), sometimes it can be really huge! The problem is that I have to save this string in a column of a table of a "SQL Server" database, the bad news is that I am not allowed to do any migration in this database, the good news is that the column already has type nvarchar(max).

I've done some research before and followed the following post to write a data compressor using "Gzip" and "Brotli".

https://khalidabuhakmeh.com/compress-strings-with-dotnet-and-csharp

var value = "hello world";
var level = CompressionLevel.SmallestSize;

var bytes = Encoding.Unicode.GetBytes(value);
await using var input = new MemoryStream(bytes);
await using var output = new MemoryStream();

// GZipStream with BrotliStream
await using var stream = new GZipStream(output, level);

await input.CopyToAsync(stream);

var result = output.ToArray();
var resultString = Convert.ToBase64String(result);

After implementing the conversion methods, I created tests that generate random strings of varying sizes (length) to validate the performance of the compressors, at this point I noticed the following. Both "Gzip" and "Brotli" first convert to byte[] (byte array), then apply compression, which gives a result vector (byte array) of reduced size as expected, but then converts the result (byte[]) to a base 64 string that, in 100% of the tests it has more characters (length) than the initial string.

My random string generator:

var rd_char = new Random();
var rd_length = new Random();
var wordLength = rd_length.Next(randomWordParameters.WordMinLength, randomWordParameters.WordMaxLength);
var sb = new StringBuilder();
int sourceNumber;
for (int i = 0; i < wordLength; i++)
{
    sourceNumber = rd_char.Next(randomWordParameters.CharLowerBound, randomWordParameters.CharUpperBound);
    sb.Append(Convert.ToChar(sourceNumber));
}
var word = sb.ToString();

My sample strings don't exactly contain perfect representations of the cases at hand, but I believe they are good enough. Here's the string generator method, in fact it generates completely random strings in a given size range, and I used in the tests characters provided from the 33 ~ 127 values passed to the Convert.ToChar() method. The strings provided by the system are in JSON format, in practice they are lists of URLs (with tens of thousands of urls), the urls usually have random sequences of characters, so I tried to generate strings as random as possible.

The fact is that, considering the case where I try to save in the database a string that originally (before compression) is larger than the maximum size (length) allowed in the column, when saving in the database, the "data" that goes to the column of the table in question is the result "base 64" string generated after compression and not the reduced size vector (byte array), and I believe the database will refuse the string (base 64) since its length (in the number of characters) is greater than the length of the original string.

So here's my question, is there any (invertible) way to convert a string into a smaller one, and when I say smaller I mean "with reduced length"? It seems that "Gzip" or "Brotli" doesn't solve the problem.

PS: I made sure to highlight the term "length" several times to make it clear that at that point in the text I'm talking about the number of characters and not the length in memory, because I noticed in several forums that I read before, that this confusion was making it difficult to reach conclusions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

故事和酒 2025-01-29 01:24:21

压缩算法正在利用输入流中的重复模式。典型的URL中没有太多的重复，因此压缩单个URL不太可能产生比原始URL短得多的表示形式。如果URL根本没有重复模式（如果接近随机字符串），则压缩算法将产生比输入更大的输出。

以下是这种行为的证明，使用 <代码> encoding.utf8 将URL转换为字节， encoding.latin1 将压缩字节转换为字符串：

static string Compress(string value)
{
    byte[] bytes = Encoding.UTF8.GetBytes(value);
    using var input = new MemoryStream(bytes);
    using var output = new MemoryStream();
    using (var gz = new GZipStream(output, CompressionLevel.SmallestSize))
        input.CopyTo(gz);
    byte[] result = output.ToArray();
    return Encoding.Latin1.GetString(result);
}

static string Decompress(string compressedValue)
{
    byte[] bytes = Encoding.Latin1.GetBytes(compressedValue);
    using var input = new MemoryStream(bytes);
    using var output = new MemoryStream();
    using (var gz = new GZipStream(input, CompressionMode.Decompress))
        gz.CopyTo(output);
    byte[] result = output.ToArray();
    return Encoding.UTF8.GetString(result);
}

我使用了三个相当长且非重复的URL进行测试：

string[] urls = new string[]
{
    "https://stackoverflow.com/questions/71884821/is-there-any-invertible-way-in-c-to-convert-a-string-into-a-smaller-one-an#comment127033258_71884821",
    "https://github.com/dotnet/runtime/blob/2d4f2d0c8f60d5f49e39f3ddbe1824648ee2b306/src/libraries/System.Private.CoreLib/src/System/Text/Encoding.cs#L77",
    "https://sharplab.io/#v2:CYLg1APgAgTAjAWAFBQMwAJabgdmQb2XWMwygBZ0BZAQwEsA7ACgEoiTCkTvsBOJgEQAJAKYAbMQHt0Ad0kAnMcAEsA3O2IBfZJqA===",
};
foreach (var original in urls)
{
    Console.WriteLine($"Original:     {original.Length} chars, {original.Substring(0, 50)}...");
    var compressed = Compress(original);
    double compression = (original.Length - compressed.Length) / (double)original.Length;
    Console.WriteLine($"Compressed:   {compressed.Length} chars, compression: {compression:0.00%}");
    var decompressed = Decompress(compressed);
    Console.WriteLine($"Decompressed: {decompressed.Length} chars");
    Console.WriteLine($"Successful:   {decompressed == original}");
    Console.WriteLine();
}

输出：output：

Original:     145 chars, https://stackoverflow.com/questions/71884821/is-th...
Compressed:   133 chars, compression: 8.28%
Decompressed: 145 chars
Successful:   True

Original:     148 chars, https://github.com/dotnet/runtime/blob/2d4f2d0c8f6...
Compressed:   143 chars, compression: 3.38%
Decompressed: 148 chars
Successful:   True

Original:     128 chars, https://sharplab.io/#v2:CYLg1APgAgTAjAWAFBQMwAJabg...
Compressed:   141 chars, compression: -10.16%
Decompressed: 128 chars
Successful:   True

在小提琴上尝试一下。

压缩后，三个URL中的两个略短，但第三个URL却肿了。

您可以将压缩值或原始值存储在数据库中，具体取决于较短的值。您可以将存储的值带有一些标记，例如'C'或'u'，以便您知道它是压缩还是未压缩。

The compression algorithms are exploiting repetitive patterns in the input stream. There is no much repetition in a typical URL, so compressing a single URL is unlikely to yield a representation that is much shorter than the original. In case the URL has no repetitive patterns at all (if it's close to a random string), the compression algorithm is going to yield a larger output than the input.

Below is a demonstration of this behavior, using the Encoding.UTF8 to convert the URLs to bytes, and the Encoding.Latin1 to convert the compressed bytes to a string:

static string Compress(string value)
{
    byte[] bytes = Encoding.UTF8.GetBytes(value);
    using var input = new MemoryStream(bytes);
    using var output = new MemoryStream();
    using (var gz = new GZipStream(output, CompressionLevel.SmallestSize))
        input.CopyTo(gz);
    byte[] result = output.ToArray();
    return Encoding.Latin1.GetString(result);
}

static string Decompress(string compressedValue)
{
    byte[] bytes = Encoding.Latin1.GetBytes(compressedValue);
    using var input = new MemoryStream(bytes);
    using var output = new MemoryStream();
    using (var gz = new GZipStream(input, CompressionMode.Decompress))
        gz.CopyTo(output);
    byte[] result = output.ToArray();
    return Encoding.UTF8.GetString(result);
}

I used three fairly long and non-repetitive URLs for the test:

string[] urls = new string[]
{
    "https://stackoverflow.com/questions/71884821/is-there-any-invertible-way-in-c-to-convert-a-string-into-a-smaller-one-an#comment127033258_71884821",
    "https://github.com/dotnet/runtime/blob/2d4f2d0c8f60d5f49e39f3ddbe1824648ee2b306/src/libraries/System.Private.CoreLib/src/System/Text/Encoding.cs#L77",
    "https://sharplab.io/#v2:CYLg1APgAgTAjAWAFBQMwAJabgdmQb2XWMwygBZ0BZAQwEsA7ACgEoiTCkTvsBOJgEQAJAKYAbMQHt0Ad0kAnMcAEsA3O2IBfZJqA===",
};
foreach (var original in urls)
{
    Console.WriteLine(quot;Original:     {original.Length} chars, {original.Substring(0, 50)}...");
    var compressed = Compress(original);
    double compression = (original.Length - compressed.Length) / (double)original.Length;
    Console.WriteLine(quot;Compressed:   {compressed.Length} chars, compression: {compression:0.00%}");
    var decompressed = Decompress(compressed);
    Console.WriteLine(quot;Decompressed: {decompressed.Length} chars");
    Console.WriteLine(quot;Successful:   {decompressed == original}");
    Console.WriteLine();
}

Output:

Original:     145 chars, https://stackoverflow.com/questions/71884821/is-th...
Compressed:   133 chars, compression: 8.28%
Decompressed: 145 chars
Successful:   True

Original:     148 chars, https://github.com/dotnet/runtime/blob/2d4f2d0c8f6...
Compressed:   143 chars, compression: 3.38%
Decompressed: 148 chars
Successful:   True

Original:     128 chars, https://sharplab.io/#v2:CYLg1APgAgTAjAWAFBQMwAJabg...
Compressed:   141 chars, compression: -10.16%
Decompressed: 128 chars
Successful:   True

Try it on Fiddle.

Two of the three URLs became slightly shorter after the compression, but the third URL was bloated instead.

You could store in the database either the compressed or the original value, depending on which is the shorter one. You could prefix the stored value with some marker, for example 'C' or 'U', so that you know if it's compressed or uncompressed.

回复收藏 0 原文