如何在 UTF-8 字节数组中找到字符串的起始索引? (C#)

发布于 2024-09-28 19:10:37 字数 476 浏览 5 评论 0原文

我有一个 UTF-8 字节数据数组。我想在 C# 的字节数组中搜索特定字符串。

byte[] dataArray = (一些 UTF-8 字节数据数组);

string searchString = "Hello";

如何找到第一次出现的单词“Hello”在数组 dataArray 中并返回字符串开始的索引位置(“Hello”中的“H”将位于 dataArray 中)?

之前,我错误地使用了如下内容:

int helloIndex = Encoding.UTF8.GetString(dataArray).IndexOf("Hello");

显然,该代码不能保证正常工作,因为我返回的是String 的索引,而不是 UTF-8 字节数组的索引。是否有任何内置的 C# 方法或经过验证的、高效的代码可供我重用?

谢谢,

马特

I have a UTF-8 byte array of data. I would like to search for a specific string in the array of bytes in C#.

byte[] dataArray = (some UTF-8 byte array of data);

string searchString = "Hello";

How do I find the first occurrence of the word "Hello" in the array dataArray and return an index location where the string begins (where the 'H' from 'Hello' would be located in dataArray)?

Before, I was erroneously using something like:

int helloIndex = Encoding.UTF8.GetString(dataArray).IndexOf("Hello");

Obviously, that code would not be guaranteed to work since I am returning the index of a String, not the index of the UTF-8 byte array. Are there any built-in C# methods or proven, efficient code I can reuse?

Thanks,

Matt

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

执妄 2024-10-05 19:10:37

UTF-8 的一个很好的功能是,如果一个字节序列代表一个字符,并且该字节序列出现在有效 UTF-8 编码数据中的任何位置,那么它总是代表该字符。

知道了这一点,您可以将要搜索的字符串转换为字节数组,然后使用 Boyer-Moore 字符串搜索算法(或您喜欢的任何其他字符串搜索算法)稍微适应于字节数组而不是字符串。

这里有很多答案可以帮助您:

One of the nice features about UTF-8 is that if a sequence of bytes represents a character and that sequence of bytes appears anywhere in valid UTF-8 encoded data then it always represents that character.

Knowing this, you can convert the string you are searching for to a byte array and then use the Boyer-Moore string searching algorithm (or any other string searching algorithm you like) adapted slightly to work on byte arrays instead of strings.

There are a number of answers here that can help you:

趁年轻赶紧闹 2024-10-05 19:10:37

尝试以下代码片段:

// Setup our little test.

string sourceText = "ʤhello";

byte[] searchBytes = Encoding.UTF8.GetBytes(sourceText);

// Convert the bytes into a string we can search in.

string searchText = Encoding.UTF8.GetString(searchBytes);

int position = searchText.IndexOf("hello");

// Get all text that is before the position we found.

string before = searchText.Substring(0, position);

// The length of the encoded bytes is the actual number of UTF8 bytes
// instead of the position.

int bytesBefore = Encoding.UTF8.GetBytes(before).Length;

// This outputs Position is 1 and before is 2.

Console.WriteLine("Position is {0} and before is {1}", position, bytesBefore);

Try the following snippet:

// Setup our little test.

string sourceText = "ʤhello";

byte[] searchBytes = Encoding.UTF8.GetBytes(sourceText);

// Convert the bytes into a string we can search in.

string searchText = Encoding.UTF8.GetString(searchBytes);

int position = searchText.IndexOf("hello");

// Get all text that is before the position we found.

string before = searchText.Substring(0, position);

// The length of the encoded bytes is the actual number of UTF8 bytes
// instead of the position.

int bytesBefore = Encoding.UTF8.GetBytes(before).Length;

// This outputs Position is 1 and before is 2.

Console.WriteLine("Position is {0} and before is {1}", position, bytesBefore);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文