使用正则表达式搜索多字节字符串
我正在使用 WebBrowser 控件处理 html 文档,我需要制作一个实用程序来搜索单词并在浏览器中突出显示它。如果字符串是英语,它会很好地工作,但对于其他语言的字符串(例如韩语),它似乎不起作用。
下面提到的代码工作的场景是 -
考虑用户在网页中选择了一个单词“Example”,现在我需要突出显示该单词及其所有出现的情况。我还需要计算它们的 byteOffset (代码片段仅执行此操作)。
现在,对于英语,下面的代码可以正常工作,但对于韩语等语言,它根本不起作用。
它没有进入 for-each 循环,
foreach (Match m in reg.Matches(this._documentContent))
这里 _documentContent 包含网页源作为字符串。 occurrenceNo 是编号。文档中所选单词出现的次数
这是代码,strTemp 包含韩语字符串:
string strTemp = myRange.text;
string strExp =@">(([^<])*?)" + strTemp + "(([^<])*?)<";
int intCount =0;
Regex reg = new Regex(strExp);
Regex reg1 = new Regex(strTemp);
foreach (Match m in reg.Matches(this._documentContent))
{
string strMatch = m.Value;
foreach (Match m2 in reg.Matches(strMatch))
{
intCount += 1;
if (intCount==OccurenceNo)
{
int intCharOffset = m.Index + m2.Index;
System.Text.UTF8Encoding d = new System.Text.UTF8Encoding();
int intByteOffset = d.GetBytes( _documentContent.Substring(1, intCharOffset)).Length;
}
}
}
I am working on html documents using WebBrowser control, I need to make a utility which searches a word and highlights it in the browser. It works well if the string is in English, but for strings in other languages for example in Korean, it doesn't seem to work.
The Scenario where the below mentioned code works is-
Consider user has selected a word "Example" in the Webpage, now I need to highlight this word and all its occurences. Also I need to calculate their byteOffset (the code snippet does that only).
Now for English language the below code works fine but for languages like Korean it does not worked at all.
its not getting inside the for-each loop
foreach (Match m in reg.Matches(this._documentContent))
here _documentContent contains the webpage source as string.
occurenceNo is the no. of occurence of selected word in the document
Here's the code, strTemp contains korean string:
string strTemp = myRange.text;
string strExp =@">(([^<])*?)" + strTemp + "(([^<])*?)<";
int intCount =0;
Regex reg = new Regex(strExp);
Regex reg1 = new Regex(strTemp);
foreach (Match m in reg.Matches(this._documentContent))
{
string strMatch = m.Value;
foreach (Match m2 in reg.Matches(strMatch))
{
intCount += 1;
if (intCount==OccurenceNo)
{
int intCharOffset = m.Index + m2.Index;
System.Text.UTF8Encoding d = new System.Text.UTF8Encoding();
int intByteOffset = d.GetBytes( _documentContent.Substring(1, intCharOffset)).Length;
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果代码适用于英语单词,但不返回任何韩语结果,那么我可能会认为这是文化问题,因此您可以尝试将 RegexOptions 设置为 CultureInvariant:
If the code works for English words, but does not return any results for Korean, then I might suggest that it's a culture issue, so you might try setting the RegexOptions to CultureInvariant:
我正在使用以下韩语正则表达式代码:
I am using the following RegEx code for Korean: