根据字节长度缩短UTF8字符串的最佳方法

发布于 2024-07-29 23:46:36 字数 463 浏览 7 评论 0原文

最近的一个项目要求将数据导入 Oracle 数据库。 执行此操作的程序是一个 C# .Net 3.5 应用程序,我使用 Oracle.DataAccess 连接库来处理实际的插入。

我遇到了一个问题,在插入特定字段时我会收到此错误消息:

ORA-12899 Value Too Large for column X

我使用了 Field.Substring(0, MaxLength); 但仍然得到了错误(虽然不是每条记录)。

最后我看到了应该很明显的东西,我的字符串是 ANSI 格式,字段是 UTF8 格式。 它的长度以字节为单位定义,而不是字符。

这让我想到了我的问题。 修剪字符串以固定 MaxLength 的最佳方法是什么?

我的子字符串代码按字符长度工作。 是否有简单的 C# 函数可以按字节长度智能地修剪 UT8 字符串(即不砍掉半个字符)?

A recent project called for importing data into an Oracle database. The program that will do this is a C# .Net 3.5 app and I'm using the Oracle.DataAccess connection library to handle the actual inserting.

I ran into a problem where I'd receive this error message when inserting a particular field:

ORA-12899 Value too large for column X

I used Field.Substring(0, MaxLength); but still got the error (though not for every record).

Finally I saw what should have been obvious, my string was in ANSI and the field was UTF8. Its length is defined in bytes, not characters.

This gets me to my question. What is the best way to trim my string to fix the MaxLength?

My substring code works by character length. Is there simple C# function that can trim a UT8 string intelligently by byte length (ie not hack off half a character) ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

夜雨飘雪 2024-08-05 23:46:36

我认为我们可以做得比每次添加时天真地计算字符串的总长度更好。 LINQ 很酷,但它可能会意外地导致低效的代码。 如果我想要一个巨大的 UTF 字符串的前 80,000 个字节怎么办? 这是很多不必要的计数。 “我有 1 个字节。现在我有 2 个字节。现在我有 13 个字节……现在我有 52,384 个字节……”

这很愚蠢。 大多数时候,至少在英语中,我们可以精确地剪切第n个字节。 即使在另一种语言中,我们距离一个好的切入点也只有不到 6 个字节。

所以我将从 @Oren 的建议开始,即关闭 UTF8 字符值的前导位。 让我们从第 n+1 个字节开始,然后使用 Oren 的技巧来确定是否需要提前剪切几个字节。

三种可能性

如果剪切后的第一个字节的前导位有一个 0,我知道我正好在单个字节(传统 ASCII)字符之前剪切,并且可以切干净。

如果剪切后有一个11,则剪切后的下一个字节是多字节字符的开始,所以这也是剪切的好地方!

但是,如果我有一个 10,我知道我正处于多字节字符的中间,并且需要返回检查它真正开始的位置。

也就是说,虽然我想在第 n 个字节之后剪切字符串,但如果第 n+1 个字节位于多字节字符的中间,则剪切将创建无效的 UTF8 值。 我需要备份,直到到达以 11 开头的一个并在其之前剪切。

代码

注释:我使用诸如 Convert.ToByte("11000000", 2) 这样的东西,这样就可以很容易地知道我正在屏蔽哪些位(更多关于位屏蔽此处)。 简而言之,我&返回字节前两位的内容,并返回0其余部分。 然后,我检查 XX000000 中的 XX,看看它是 10 还是 11(如果适用)。

我今天发现C# 6.0 实际上可能支持二进制表示,这很酷,但是我们现在将继续使用这个拼凑来说明正在发生的事情。

PadLeft 只是因为我对控制台的输出过于强迫。

因此,这里有一个函数,可以将您缩减为 n 字节长的字符串或小于 n 的最大数字,并以“完整”UTF8 字符结尾。

public static string CutToUTF8Length(string str, int byteLength)
{
    byte[] byteArray = Encoding.UTF8.GetBytes(str);
    string returnValue = string.Empty;

    if (byteArray.Length > byteLength)
    {
        int bytePointer = byteLength;

        // Check high bit to see if we're [potentially] in the middle of a multi-byte char
        if (bytePointer >= 0 
            && (byteArray[bytePointer] & Convert.ToByte("10000000", 2)) > 0)
        {
            // If so, keep walking back until we have a byte starting with `11`,
            // which means the first byte of a multi-byte UTF8 character.
            while (bytePointer >= 0 
                && Convert.ToByte("11000000", 2) != (byteArray[bytePointer] & Convert.ToByte("11000000", 2)))
            {
                bytePointer--;
            }
        }

        // See if we had 1s in the high bit all the way back. If so, we're toast. Return empty string.
        if (0 != bytePointer)
        {
            returnValue = Encoding.UTF8.GetString(byteArray, 0, bytePointer); // hat tip to @NealEhardt! Well played. ;^)
        }
    }
    else
    {
        returnValue = str;
    }

    return returnValue;
}

我最初将其写为字符串扩展。 当然,只需在 string str 之前添加 this 即可将其恢复为扩展格式。 我删除了 this,以便我们可以将该方法放入简单控制台应用程序中的 Program.cs 中进行演示。

测试和预期输出

这是一个很好的测试用例,它在下面创建的输出,编写为简单控制台应用的 Program.cs< 中的 Main 方法/代码>。

static void Main(string[] args)
{
    string testValue = "12345“”67890”";

    for (int i = 0; i < 15; i++)
    {
        string cutValue = Program.CutToUTF8Length(testValue, i);
        Console.WriteLine(i.ToString().PadLeft(2) +
            ": " + Encoding.UTF8.GetByteCount(cutValue).ToString().PadLeft(2) +
            ":: " + cutValue);
    }

    Console.WriteLine();
    Console.WriteLine();

    foreach (byte b in Encoding.UTF8.GetBytes(testValue))
    {
        Console.WriteLine(b.ToString().PadLeft(3) + " " + (char)b);
    }

    Console.WriteLine("Return to end.");
    Console.ReadLine();
}

输出如下。 请注意,testValue 中的“智能引号”在 UTF8 中是三个字节长(尽管当我们以 ASCII 将字符写入控制台时,它会输出哑引号)。 另请注意输出中每个智能引号的第二个和第三个字节的 ? 输出。

我们的 testValue 的前五个字符是 UTF8 中的单字节,因此 0-5 字节值应该是 0-5 个字符。 然后我们有一个三字节的智能引用,直到 5 + 3 字节才能完整包含它。 果然,我们看到在调用 8 时弹出。我们的下一个智能引号在 8 + 3 = 11 处弹出,然后我们回到单字节字符到 14。

 0:  0::
 1:  1:: 1
 2:  2:: 12
 3:  3:: 123
 4:  4:: 1234
 5:  5:: 12345
 6:  5:: 12345
 7:  5:: 12345
 8:  8:: 12345"
 9:  8:: 12345"
10:  8:: 12345"
11: 11:: 12345""
12: 12:: 12345""6
13: 13:: 12345""67
14: 14:: 12345""678


 49 1
 50 2
 51 3
 52 4
 53 5
226 â
128 ?
156 ?
226 â
128 ?
157 ?
 54 6
 55 7
 56 8
 57 9
 48 0
226 â
128 ?
157 ?
Return to end.

所以这很好很有趣,而且我正值这个问题五周年纪念日之前。 尽管奥伦对这些位的描述有一个小错误,但这正是您想要使用的技巧。 谢谢你的提问; 整洁的。

I think we can do better than naively counting the total length of a string with each addition. LINQ is cool, but it can accidentally encourage inefficient code. What if I wanted the first 80,000 bytes of a giant UTF string? That's a lot of unnecessary counting. "I've got 1 byte. Now I've got 2. Now I've got 13... Now I have 52,384..."

That's silly. Most of the time, at least in l'anglais, we can cut exactly on that nth byte. Even in another language, we're less than 6 bytes away from a good cutting point.

So I'm going to start from @Oren's suggestion, which is to key off of the leading bit of a UTF8 char value. Let's start by cutting right at the n+1th byte, and use Oren's trick to figure out if we need to cut a few bytes earlier.

Three possibilities

If the first byte after the cut has a 0 in the leading bit, I know I'm cutting precisely before a single byte (conventional ASCII) character, and can cut cleanly.

If I have a 11 following the cut, the next byte after the cut is the start of a multi-byte character, so that's a good place to cut too!

If I have a 10, however, I know I'm in the middle of a multi-byte character, and need to go back to check to see where it really starts.

That is, though I want to cut the string after the nth byte, if that n+1th byte comes in the middle of a multi-byte character, cutting would create an invalid UTF8 value. I need to back up until I get to one that starts with 11 and cut just before it.

Code

Notes: I'm using stuff like Convert.ToByte("11000000", 2) so that it's easy to tell what bits I'm masking (a little more about bit masking here). In a nutshell, I'm &ing to return what's in the byte's first two bits and bringing back 0s for the rest. Then I check the XX from XX000000 to see if it's 10 or 11, where appropriate.

I found out today that C# 6.0 might actually support binary representations, which is cool, but we'll keep using this kludge for now to illustrate what's going on.

The PadLeft is just because I'm overly OCD about output to the Console.

So here's a function that'll cut you down to a string that's n bytes long or the greatest number less than n that's ends with a "complete" UTF8 character.

public static string CutToUTF8Length(string str, int byteLength)
{
    byte[] byteArray = Encoding.UTF8.GetBytes(str);
    string returnValue = string.Empty;

    if (byteArray.Length > byteLength)
    {
        int bytePointer = byteLength;

        // Check high bit to see if we're [potentially] in the middle of a multi-byte char
        if (bytePointer >= 0 
            && (byteArray[bytePointer] & Convert.ToByte("10000000", 2)) > 0)
        {
            // If so, keep walking back until we have a byte starting with `11`,
            // which means the first byte of a multi-byte UTF8 character.
            while (bytePointer >= 0 
                && Convert.ToByte("11000000", 2) != (byteArray[bytePointer] & Convert.ToByte("11000000", 2)))
            {
                bytePointer--;
            }
        }

        // See if we had 1s in the high bit all the way back. If so, we're toast. Return empty string.
        if (0 != bytePointer)
        {
            returnValue = Encoding.UTF8.GetString(byteArray, 0, bytePointer); // hat tip to @NealEhardt! Well played. ;^)
        }
    }
    else
    {
        returnValue = str;
    }

    return returnValue;
}

I initially wrote this as a string extension. Just add back the this before string str to put it back into extension format, of course. I removed the this so that we could just slap the method into Program.cs in a simple console app to demonstrate.

Test and expected output

Here's a good test case, with the output it create below, written expecting to be the Main method in a simple console app's Program.cs.

static void Main(string[] args)
{
    string testValue = "12345“”67890”";

    for (int i = 0; i < 15; i++)
    {
        string cutValue = Program.CutToUTF8Length(testValue, i);
        Console.WriteLine(i.ToString().PadLeft(2) +
            ": " + Encoding.UTF8.GetByteCount(cutValue).ToString().PadLeft(2) +
            ":: " + cutValue);
    }

    Console.WriteLine();
    Console.WriteLine();

    foreach (byte b in Encoding.UTF8.GetBytes(testValue))
    {
        Console.WriteLine(b.ToString().PadLeft(3) + " " + (char)b);
    }

    Console.WriteLine("Return to end.");
    Console.ReadLine();
}

Output follows. Notice that the "smart quotes" in testValue are three bytes long in UTF8 (though when we write the chars to the console in ASCII, it outputs dumb quotes). Also note the ?s output for the second and third bytes of each smart quote in the output.

The first five characters of our testValue are single bytes in UTF8, so 0-5 byte values should be 0-5 characters. Then we have a three-byte smart quote, which can't be included in its entirety until 5 + 3 bytes. Sure enough, we see that pop out at the call for 8.Our next smart quote pops out at 8 + 3 = 11, and then we're back to single byte characters through 14.

 0:  0::
 1:  1:: 1
 2:  2:: 12
 3:  3:: 123
 4:  4:: 1234
 5:  5:: 12345
 6:  5:: 12345
 7:  5:: 12345
 8:  8:: 12345"
 9:  8:: 12345"
10:  8:: 12345"
11: 11:: 12345""
12: 12:: 12345""6
13: 13:: 12345""67
14: 14:: 12345""678


 49 1
 50 2
 51 3
 52 4
 53 5
226 â
128 ?
156 ?
226 â
128 ?
157 ?
 54 6
 55 7
 56 8
 57 9
 48 0
226 â
128 ?
157 ?
Return to end.

So that's kind of fun, and I'm in just before the question's five year anniversary. Though Oren's description of the bits had a small error, that's exactly the trick you want to use. Thanks for the question; neat.

絕版丫頭 2024-08-05 23:46:36

这里有两种可能的解决方案 - LINQ 单行从左到右处理输入,传统的 for 循环从右到左处理输入。 哪个处理方向更快取决于字符串长度、允许的字节长度以及多字节字符的数量和分布,很难给出一般性建议。 LINQ 和传统代码之间的决定可能是品味问题(或者可能是速度问题)。

如果速度很重要,可以考虑只累加每个字符的字节长度直到达到最大长度,而不是在每次迭代中计算整个字符串的字节长度。 但我不确定这是否可行,因为我不太了解 UTF-8 编码。 我理论上可以想象字符串的字节长度不等于所有字符的字节长度之和。

public static String LimitByteLength(String input, Int32 maxLength)
{
    return new String(input
        .TakeWhile((c, i) =>
            Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
        .ToArray());
}

public static String LimitByteLength2(String input, Int32 maxLength)
{
    for (Int32 i = input.Length - 1; i >= 0; i--)
    {
        if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
        {
            return input.Substring(0, i + 1);
        }
    }

    return String.Empty;
}

Here are two possible solution - a LINQ one-liner processing the input left to right and a traditional for-loop processing the input from right to left. Which processing direction is faster depends on the string length, the allowed byte length, and the number and distribution of multibyte characters and is hard to give a general suggestion. The decision between LINQ and traditional code I probably a matter of taste (or maybe speed).

If speed matters, one could think about just accumulating the byte length of each character until reaching the maximum length instead of calculating the byte length of the whole string in each iteration. But I am not sure if this will work because I don't know UTF-8 encoding well enough. I could theoreticaly imagine that the byte length of a string does not equal the sum of the byte lengths of all characters.

public static String LimitByteLength(String input, Int32 maxLength)
{
    return new String(input
        .TakeWhile((c, i) =>
            Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
        .ToArray());
}

public static String LimitByteLength2(String input, Int32 maxLength)
{
    for (Int32 i = input.Length - 1; i >= 0; i--)
    {
        if (Encoding.UTF8.GetByteCount(input.Substring(0, i + 1)) <= maxLength)
        {
            return input.Substring(0, i + 1);
        }
    }

    return String.Empty;
}
太傻旳人生 2024-08-05 23:46:36

所有其他答案似乎都忽略了这样一个事实:此功能已经内置到 .NET 中,位于 编码器 类。 为了加分,这种方法也适用于其他编码。

public static string LimitByteLength(string message, int maxLength)
{
    if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
    {
        return message;
    }

    var encoder = Encoding.UTF8.GetEncoder();
    byte[] buffer = new byte[maxLength];
    char[] messageChars = message.ToCharArray();
    encoder.Convert(
        chars: messageChars,
        charIndex: 0,
        charCount: messageChars.Length,
        bytes: buffer,
        byteIndex: 0,
        byteCount: buffer.Length,
        flush: false,
        charsUsed: out int charsUsed,
        bytesUsed: out int bytesUsed,
        completed: out bool completed);

    // I don't think we can return message.Substring(0, charsUsed)
    // as that's the number of UTF-16 chars, not the number of codepoints
    // (think about surrogate pairs). Therefore I think we need to
    // actually convert bytes back into a new string
    return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}

如果您使用的是 .NET Standard 2.1+,则可以稍微简化一下:

public static string LimitByteLength(string message, int maxLength)
{
    if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
    {
        return message;
    }

    var encoder = Encoding.UTF8.GetEncoder();
    byte[] buffer = new byte[maxLength];
    encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
    return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}

其他答案都没有考虑扩展字素簇,例如

All of the other answers appear to miss the fact that this functionality is already built into .NET, in the Encoder class. For bonus points, this approach will also work for other encodings.

public static string LimitByteLength(string message, int maxLength)
{
    if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
    {
        return message;
    }

    var encoder = Encoding.UTF8.GetEncoder();
    byte[] buffer = new byte[maxLength];
    char[] messageChars = message.ToCharArray();
    encoder.Convert(
        chars: messageChars,
        charIndex: 0,
        charCount: messageChars.Length,
        bytes: buffer,
        byteIndex: 0,
        byteCount: buffer.Length,
        flush: false,
        charsUsed: out int charsUsed,
        bytesUsed: out int bytesUsed,
        completed: out bool completed);

    // I don't think we can return message.Substring(0, charsUsed)
    // as that's the number of UTF-16 chars, not the number of codepoints
    // (think about surrogate pairs). Therefore I think we need to
    // actually convert bytes back into a new string
    return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}

If you're using .NET Standard 2.1+, you can simplify it a bit:

public static string LimitByteLength(string message, int maxLength)
{
    if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
    {
        return message;
    }

    var encoder = Encoding.UTF8.GetEncoder();
    byte[] buffer = new byte[maxLength];
    encoder.Convert(message.AsSpan(), buffer.AsSpan(), false, out _, out int bytesUsed, out _);
    return Encoding.UTF8.GetString(buffer, 0, bytesUsed);
}

None of the other answers account for extended grapheme clusters, such as ????????‍????. This is composed of 4 Unicode scalars (????, ????, a zero-width joiner, and ????), so you need knowledge of the Unicode standard to avoid splitting it in the middle and producing ???? or ????????.

In .NET 5 onwards, you can write this as:

public static string LimitByteLength(string message, int maxLength)
{
    if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= maxLength)
    {
        return message;
    }
    
    var enumerator = StringInfo.GetTextElementEnumerator(message);
    var result = new StringBuilder();
    int lengthBytes = 0;
    while (enumerator.MoveNext())
    {
        lengthBytes += Encoding.UTF8.GetByteCount(enumerator.GetTextElement());
        if (lengthBytes <= maxLength)
        {
            result.Append(enumerator.GetTextElement()); 
        }
    }
    
    return result.ToString();
}

(This same code runs on earlier versions of .NET, but due to a bug it won't produce the correct result before .NET 5).

愁以何悠 2024-08-05 23:46:36

ruffin 的回答的简短版本。 利用UTF8 的设计

    public static string LimitUtf8ByteCount(this string s, int n)
    {
        // quick test (we probably won't be trimming most of the time)
        if (Encoding.UTF8.GetByteCount(s) <= n)
            return s;
        // get the bytes
        var a = Encoding.UTF8.GetBytes(s);
        // if we are in the middle of a character (highest two bits are 10)
        if (n > 0 && ( a[n]&0xC0 ) == 0x80)
        {
            // remove all bytes whose two highest bits are 10
            // and one more (start of multi-byte sequence - highest bits should be 11)
            while (--n > 0 && ( a[n]&0xC0 ) == 0x80)
                ;
        }
        // convert back to string (with the limit adjusted)
        return Encoding.UTF8.GetString(a, 0, n);
    }

Shorter version of ruffin's answer. Takes advantage of the design of UTF8:

    public static string LimitUtf8ByteCount(this string s, int n)
    {
        // quick test (we probably won't be trimming most of the time)
        if (Encoding.UTF8.GetByteCount(s) <= n)
            return s;
        // get the bytes
        var a = Encoding.UTF8.GetBytes(s);
        // if we are in the middle of a character (highest two bits are 10)
        if (n > 0 && ( a[n]&0xC0 ) == 0x80)
        {
            // remove all bytes whose two highest bits are 10
            // and one more (start of multi-byte sequence - highest bits should be 11)
            while (--n > 0 && ( a[n]&0xC0 ) == 0x80)
                ;
        }
        // convert back to string (with the limit adjusted)
        return Encoding.UTF8.GetString(a, 0, n);
    }
有木有妳兜一样 2024-08-05 23:46:36

如果 UTF-8 字节 具有零值高位,则它是字符的开头。 如果它的高位为 1,则它位于字符的“中间”。 检测字符开头的能力是 UTF-8 的明确设计目标。

查看维基百科文章的描述部分了解更多详细信息。

If a UTF-8 byte has a zero-valued high order bit, it's the beginning of a character. If its high order bit is 1, it's in the 'middle' of a character. The ability to detect the beginning of a character was an explicit design goal of UTF-8.

Check out the Description section of the wikipedia article for more detail.

酸甜透明夹心 2024-08-05 23:46:36

您是否需要以字节为单位声明数据库列? 这是默认值,但如果数据库字符集是可变宽度,则它不是特别有用的默认值。 我强烈建议用字符来声明该列。

CREATE TABLE length_example (
  col1 VARCHAR2( 10 BYTE ),
  col2 VARCHAR2( 10 CHAR )
);

这将创建一个表,其中 COL1 将存储 10 个字节的数据,而 col2 将存储 10 个字符的数据。 字符长度语义在 UTF8 数据库中更有意义。

假设您希望创建的所有表默认使用字符长度语义,则可以将初始化参数 NLS_LENGTH_SEMANTICS 设置为 CHAR。 此时,如果您未在字段长度中指定 CHAR 或 BYTE,您创建的任何表都将默认使用字符长度语义而不是字节长度语义。

Is there a reason that you need the database column to be declared in terms of bytes? That's the default, but it's not a particularly useful default if the database character set is variable width. I'd strongly prefer declaring the column in terms of characters.

CREATE TABLE length_example (
  col1 VARCHAR2( 10 BYTE ),
  col2 VARCHAR2( 10 CHAR )
);

This will create a table where COL1 will store 10 bytes of data and col2 will store 10 characters worth of data. Character length semantics make far more sense in a UTF8 database.

Assuming you want all the tables you create to use character length semantics by default, you can set the initialization parameter NLS_LENGTH_SEMANTICS to CHAR. At that point, any tables you create will default to using character length semantics rather than byte length semantics if you don't specify CHAR or BYTE in the field length.

遥远的她 2024-08-05 23:46:36

根据Oren Trutner 的评论,这里还有两个解决该问题的方法:
这里我们根据字符串末尾的每个字符来计算要从字符串末尾删除的字节数,因此我们不会在每次迭代中评估整个字符串。

string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣" 
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
   bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
   --lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= 朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣潬昣昸昸慢正 

还有一个更高效(且可维护)的解决方案:
根据所需的长度从字节数组中获取字符串并剪切最后一个字符,因为它可能已损坏

string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣" 
int maxBytesLength = 30;    
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);

第二个解决方案的唯一缺点是我们可能会剪切一个完美的最后一个字符,但我们已经在剪切字符串,所以它可能符合要求。
感谢 Shade 想到了第二种解决方案

Following Oren Trutner's comment here are two more solutions to the problem:
here we count the number of bytes to remove from the end of the string according to each character at the end of the string, so we don't evaluate the entire string in every iteration.

string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣" 
int maxBytesLength = 30;
var bytesArr = Encoding.UTF8.GetBytes(str);
int bytesToRemove = 0;
int lastIndexInString = str.Length -1;
while(bytesArr.Length - bytesToRemove > maxBytesLength)
{
   bytesToRemove += Encoding.UTF8.GetByteCount(new char[] {str[lastIndexInString]} );
   --lastIndexInString;
}
string trimmedString = Encoding.UTF8.GetString(bytesArr,0,bytesArr.Length - bytesToRemove);
//Encoding.UTF8.GetByteCount(trimmedString);//get the actual length, will be <= 朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣潬昣昸昸慢正 

And an even more efficient(and maintainable) solution:
get the string from the bytes array according to desired length and cut the last character because it might be corrupted

string str = "朣楢琴执执 瑩浻牡楧硰执执獧浻牡楧敬瑦 瀰 絸朣杢执獧扻捡杫潲湵 潣" 
int maxBytesLength = 30;    
string trimmedWithDirtyLastChar = Encoding.UTF8.GetString(Encoding.UTF8.GetBytes(str),0,maxBytesLength);
string trimmedString = trimmedWithDirtyLastChar.Substring(0,trimmedWithDirtyLastChar.Length - 1);

The only downside with the second solution is that we might cut a perfectly fine last character, but we are already cutting the string, so it might fit with the requirements.
Thanks to Shhade who thought about the second solution

↘紸啶 2024-08-05 23:46:36

这是另一种基于二分查找的解决方案:

public string LimitToUTF8ByteLength(string text, int size)
{
    if (size <= 0)
    {
        return string.Empty;
    }

    int maxLength = text.Length;
    int minLength = 0;
    int length = maxLength;

    while (maxLength >= minLength)
    {
        length = (maxLength + minLength) / 2;
        int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));

        if (byteLength > size)
        {
            maxLength = length - 1;
        }
        else if (byteLength < size)
        {
            minLength = length + 1;
        }
        else
        {
            return text.Substring(0, length); 
        }
    }

    // Round down the result
    string result = text.Substring(0, length);
    if (size >= Encoding.UTF8.GetByteCount(result))
    {
        return result;
    }
    else
    {
        return text.Substring(0, length - 1);
    }
}

This is another solution based on binary search:

public string LimitToUTF8ByteLength(string text, int size)
{
    if (size <= 0)
    {
        return string.Empty;
    }

    int maxLength = text.Length;
    int minLength = 0;
    int length = maxLength;

    while (maxLength >= minLength)
    {
        length = (maxLength + minLength) / 2;
        int byteLength = Encoding.UTF8.GetByteCount(text.Substring(0, length));

        if (byteLength > size)
        {
            maxLength = length - 1;
        }
        else if (byteLength < size)
        {
            minLength = length + 1;
        }
        else
        {
            return text.Substring(0, length); 
        }
    }

    // Round down the result
    string result = text.Substring(0, length);
    if (size >= Encoding.UTF8.GetByteCount(result))
    {
        return result;
    }
    else
    {
        return text.Substring(0, length - 1);
    }
}
末蓝 2024-08-05 23:46:36
public static string LimitByteLength3(string input, Int32 maxLenth)
    {
        string result = input;

        int byteCount = Encoding.UTF8.GetByteCount(input);
        if (byteCount > maxLenth)
        {
            var byteArray = Encoding.UTF8.GetBytes(input);
            result = Encoding.UTF8.GetString(byteArray, 0, maxLenth);
        }

        return result;
    }
public static string LimitByteLength3(string input, Int32 maxLenth)
    {
        string result = input;

        int byteCount = Encoding.UTF8.GetByteCount(input);
        if (byteCount > maxLenth)
        {
            var byteArray = Encoding.UTF8.GetBytes(input);
            result = Encoding.UTF8.GetString(byteArray, 0, maxLenth);
        }

        return result;
    }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文