使用正则表达式搜索多字节字符串

发布于 2024-08-10 20:52:51 字数 1152 浏览 9 评论 0原文

我正在使用 WebBrowser 控件处理 html 文档，我需要制作一个实用程序来搜索单词并在浏览器中突出显示它。如果字符串是英语，它会很好地工作，但对于其他语言的字符串（例如韩语），它似乎不起作用。

下面提到的代码工作的场景是 -

考虑用户在网页中选择了一个单词“Example”，现在我需要突出显示该单词及其所有出现的情况。我还需要计算它们的 byteOffset （代码片段仅执行此操作）。

现在，对于英语，下面的代码可以正常工作，但对于韩语等语言，它根本不起作用。

它没有进入 for-each 循环，

foreach (Match m in reg.Matches(this._documentContent))

这里 _documentContent 包含网页源作为字符串。 occurrenceNo 是编号。文档中所选单词出现的次数

这是代码，strTemp 包含韩语字符串：

string strTemp = myRange.text;
string strExp =@">(([^<])*?)" + strTemp + "(([^<])*?)<";

int intCount =0;
Regex reg = new Regex(strExp);
Regex reg1 = new Regex(strTemp);
foreach (Match m in reg.Matches(this._documentContent))
{ 
    string strMatch = m.Value;
    foreach (Match m2 in reg.Matches(strMatch))
    { 
        intCount += 1;
        if (intCount==OccurenceNo)
        {
            int intCharOffset = m.Index + m2.Index;
            System.Text.UTF8Encoding d = new System.Text.UTF8Encoding(); 
            int intByteOffset = d.GetBytes( _documentContent.Substring(1, intCharOffset)).Length;
        }
    }
}

原文

I am working on html documents using WebBrowser control, I need to make a utility which searches a word and highlights it in the browser. It works well if the string is in English, but for strings in other languages for example in Korean, it doesn't seem to work.

The Scenario where the below mentioned code works is-

Consider user has selected a word "Example" in the Webpage, now I need to highlight this word and all its occurences. Also I need to calculate their byteOffset (the code snippet does that only).

Now for English language the below code works fine but for languages like Korean it does not worked at all.

its not getting inside the for-each loop

foreach (Match m in reg.Matches(this._documentContent))

here _documentContent contains the webpage source as string.
occurenceNo is the no. of occurence of selected word in the document

Here's the code, strTemp contains korean string:

string strTemp = myRange.text;
string strExp =@">(([^<])*?)" + strTemp + "(([^<])*?)<";

int intCount =0;
Regex reg = new Regex(strExp);
Regex reg1 = new Regex(strTemp);
foreach (Match m in reg.Matches(this._documentContent))
{ 
    string strMatch = m.Value;
    foreach (Match m2 in reg.Matches(strMatch))
    { 
        intCount += 1;
        if (intCount==OccurenceNo)
        {
            int intCharOffset = m.Index + m2.Index;
            System.Text.UTF8Encoding d = new System.Text.UTF8Encoding(); 
            int intByteOffset = d.GetBytes( _documentContent.Substring(1, intCharOffset)).Length;
        }
    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浊酒尽余欢 2024-08-17 20:52:51

如果代码适用于英语单词，但不返回任何韩语结果，那么我可能会认为这是文化问题，因此您可以尝试将 RegexOptions 设置为 CultureInvariant：

Regex reg = new Regex(strExp, RegexOptions.CultureInvariant);
Regex reg1 = new Regex(strTemp, RegexOptions.CultureInvariant);

If the code works for English words, but does not return any results for Korean, then I might suggest that it's a culture issue, so you might try setting the RegexOptions to CultureInvariant:

Regex reg = new Regex(strExp, RegexOptions.CultureInvariant);
Regex reg1 = new Regex(strTemp, RegexOptions.CultureInvariant);

回复收藏 0 原文

这个俗人 2024-08-17 20:52:51

我正在使用以下韩语正则表达式代码：

private static readonly Regex regexKorean = new Regex(@"[가-힣]");
public static bool IsKorean(this char s)
{
    return regexKorean.IsMatch(s.ToString());
}

if (someText.Any(z => z.IsKorean()))
{
    DoSomething();
}

I am using the following RegEx code for Korean:

private static readonly Regex regexKorean = new Regex(@"[가-힣]");
public static bool IsKorean(this char s)
{
    return regexKorean.IsMatch(s.ToString());
}

if (someText.Any(z => z.IsKorean()))
{
    DoSomething();
}

回复收藏 0 原文

~没有更多了~