ANTLR 异常 - “无法将索引 111 处的 Unicode 字符 \uDCAF 转换为指定的代码页。”
召集所有 ANTLR 专家!
我有一个 .NET 程序集托管在 IIS 网站中,该网站使用 ANTLR 进行搜索引擎风格的自然语言查询处理。例如,如果用户输入:
奶酪和饼干而不是薯条,
它会构建以下语句:
AND(AND("cheese", "crackers"), NOT("chips"))
然后我们将该语句发送到我们的内容存储并向用户提供一些内容。 99.9% 的时间一切都运行良好。然而,每隔一段时间,ANTLR 就会出现一些问题,执行此处理的 IIS 托管站点会陷入某种错误状态,并不停地抛出错误,直到我们执行 IISReset/AppPool 回收。回收后,错误立即消失。
我正在捕获这些错误的堆栈跟踪,我已将其包含在下面(根据公司政策进行了清理):
System.Text.EncoderFallbackException: Unable to translate Unicode character \uDCAF at index 111 to specified code page.
at System.Text.EncoderExceptionFallbackBuffer.Fallback(Char charUnknown, Int32 index)
at System.Text.EncoderFallbackBuffer.InternalFallback(Char ch, Char*& chars)
at System.Text.UTF8Encoding.GetBytes(Char* chars, Int32 charCount, Byte* bytes, Int32 byteCount, EncoderNLS baseEncoder)
at System.Text.EncoderNLS.GetBytes(Char* chars, Int32 charCount, Byte* bytes, Int32 byteCount, Boolean flush)
at System.Text.EncoderNLS.GetBytes(Char[] chars, Int32 charIndex, Int32 charCount, Byte[] bytes, Int32 byteIndex, Boolean flush)
at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
at System.IO.StreamWriter.Write(Char[] buffer, Int32 index, Int32 count)
at System.IO.TextWriter.WriteLine(String value)
at System.IO.TextWriter.SyncTextWriter.WriteLine(String value)
at Antlr.Runtime.BaseRecognizer.EmitErrorMessage(String msg)
at Service123.Parser.atomicExpression() in Parser.cs:line 927
at Service123.Parser.notExpression() in Parser.cs:line 657
at Service123.Parser.orExpression() in Parser.cs:line 516
at Service123.Parser.andnotExpression() in Parser.cs:line 416
at Service123.Parser.andExpression() in Parser.cs:line 234
at Service123.Parser.startExpression() in Parser.cs:line 167
at Service123.Processor.ProcessQuery(String queryString) in Processor.cs:line 34
at Service123.Search.ProcessQueryString(String query) in Search.cs:line 1017
下面是我的语法文件的(根据公司政策再次进行了清理)副本:
grammar Parser;
options { language = CSharp2; output = AST; }
tokens { IMPLICIT_AND; }
@lexer::namespace { Service123.Parser }
@parser::namespace { Service123.Parser }
L_PARENTHESIS : '(';
R_PARENTHESIS : ')';
AND : ('A'|'a')('N'|'n')('D'|'d');
OR : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT : ('N'|'n')('O'|'o')('T'|'t');
fragment LETTER : ('a'..'z'|'A'..'Z');
fragment NUMBER : ('0'..'9');
fragment SYMBOL_1 : ('+'|'-'|'_'|'|'|'~'|'&'|'`'|'='|'['|']'|'{'|'}');
fragment SYMBOL_2 : ('!'|'@'|'#'|'$'|'%'|'^'|'*'|','|'.'|'/'|':'|';'|'<'|'>'|'?'|'\''|'\\');
fragment SYMBOL_QUOTE : ('"');
fragment SPACE : (' '|'\n'|'\r'|'\t'|'\u000C');
WS : (SPACE) { $channel=HIDDEN; };
PHRASE : (SYMBOL_QUOTE)(LETTER|NUMBER|SYMBOL_1|SYMBOL_2)+((SPACE)+(LETTER|NUMBER|SYMBOL_1|SYMBOL_2)+)+(SYMBOL_QUOTE);
WORD : (LETTER|NUMBER|SYMBOL_1)+;
startExpression : andExpression;
andExpression : ( andnotExpression -> andnotExpression )
(AND? e = andnotExpression -> ^(IMPLICIT_AND $andExpression $e))*;
andnotExpression : orExpression (ANDNOT^ orExpression)*;
orExpression : notExpression (OR^ notExpression)*;
notExpression : (NOT^)? atomicExpression;
atomicExpression : PHRASE | WORD | L_PARENTHESIS! andExpression R_PARENTHESIS!;
我还记录了查询字符串伴随着这些错误,它们似乎是常见的、普通的英语搜索词。
至于错误,有问题的代码点并不总是\uDCAF,而是在整个错误周期中保持一致;在我们退回服务之前,它始终是相同的代码点,然后当工作正常一周后再次出现错误时,情况就会有所不同。
我能够记录的所有代码点都是代理对的一部分,本身并不代表有效的字形。
我是一名公认的 ANTLR 新手,对其内部工作原理了解不够,无法进一步诊断。在我看来,ANTLR 运行时中有一个单例,它以某种方式搞砸了,导致所有进一步的处理变得毫无用处,直到我们重新加载程序集。然而,我没有证据证明这一点。
如果您需要更多详细信息或澄清,请随时询问,因为我已无能为力。
Calling all ANTLR experts!
I have a .NET assembly hosted in an IIS website that employs ANTLR to do natural language query processing, search-engine style. For instance, if a user types:
cheese and crackers and not chips
It builds the following statement:
AND(AND("cheese", "crackers"), NOT("chips"))
We then fire that statement off to our content store and serve up some content to the user. 99.9% of the time everything works great. However, every once in a while something goes haywire with ANTLR and the IIS-hosted site that does this processing gets stuck in some sort of error state and throws errors non-stop until we do an IISReset / AppPool recycle. After the recycle, the errors cease immediately.
I'm capturing the stack trace of these errors, which I've included (sanitized, per company policy) below:
System.Text.EncoderFallbackException: Unable to translate Unicode character \uDCAF at index 111 to specified code page.
at System.Text.EncoderExceptionFallbackBuffer.Fallback(Char charUnknown, Int32 index)
at System.Text.EncoderFallbackBuffer.InternalFallback(Char ch, Char*& chars)
at System.Text.UTF8Encoding.GetBytes(Char* chars, Int32 charCount, Byte* bytes, Int32 byteCount, EncoderNLS baseEncoder)
at System.Text.EncoderNLS.GetBytes(Char* chars, Int32 charCount, Byte* bytes, Int32 byteCount, Boolean flush)
at System.Text.EncoderNLS.GetBytes(Char[] chars, Int32 charIndex, Int32 charCount, Byte[] bytes, Int32 byteIndex, Boolean flush)
at System.IO.StreamWriter.Flush(Boolean flushStream, Boolean flushEncoder)
at System.IO.StreamWriter.Write(Char[] buffer, Int32 index, Int32 count)
at System.IO.TextWriter.WriteLine(String value)
at System.IO.TextWriter.SyncTextWriter.WriteLine(String value)
at Antlr.Runtime.BaseRecognizer.EmitErrorMessage(String msg)
at Service123.Parser.atomicExpression() in Parser.cs:line 927
at Service123.Parser.notExpression() in Parser.cs:line 657
at Service123.Parser.orExpression() in Parser.cs:line 516
at Service123.Parser.andnotExpression() in Parser.cs:line 416
at Service123.Parser.andExpression() in Parser.cs:line 234
at Service123.Parser.startExpression() in Parser.cs:line 167
at Service123.Processor.ProcessQuery(String queryString) in Processor.cs:line 34
at Service123.Search.ProcessQueryString(String query) in Search.cs:line 1017
Below is a (sanitized again, per company policy) copy of my grammar file:
grammar Parser;
options { language = CSharp2; output = AST; }
tokens { IMPLICIT_AND; }
@lexer::namespace { Service123.Parser }
@parser::namespace { Service123.Parser }
L_PARENTHESIS : '(';
R_PARENTHESIS : ')';
AND : ('A'|'a')('N'|'n')('D'|'d');
OR : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT : ('N'|'n')('O'|'o')('T'|'t');
fragment LETTER : ('a'..'z'|'A'..'Z');
fragment NUMBER : ('0'..'9');
fragment SYMBOL_1 : ('+'|'-'|'_'|'|'|'~'|'&'|'`'|'='|'['|']'|'{'|'}');
fragment SYMBOL_2 : ('!'|'@'|'#'|'
I'm also logging the query strings that go along with these errors and they appear to be common, run-of-the-mill English search terms.
As to the error, the code point in question is not always \uDCAF but is consistent throughout the error cycle; it's always the same code point until we bounce the service, then when the error crops up again after a week of things working fine it's different.
All of the code points I've been able to record are part of a surrogate pair and not representative of valid glyphs on their own.
I'm an admitted ANTLR novice and don't know enough of its inner workings to diagnose much further than this. It almost seems to me that there's a singleton in the ANTLR runtime that gets screwed up somehow and renders all further processing useless until we reload the assemblies. I have no proof of this, however.
If you need more details or clarification, please don't hesitate to ask, because I'm at my wit's end.
|'%'|'^'|'*'|','|'.'|'/'|':'|';'|'<'|'>'|'?'|'\''|'\\');
fragment SYMBOL_QUOTE : ('"');
fragment SPACE : (' '|'\n'|'\r'|'\t'|'\u000C');
WS : (SPACE) { $channel=HIDDEN; };
PHRASE : (SYMBOL_QUOTE)(LETTER|NUMBER|SYMBOL_1|SYMBOL_2)+((SPACE)+(LETTER|NUMBER|SYMBOL_1|SYMBOL_2)+)+(SYMBOL_QUOTE);
WORD : (LETTER|NUMBER|SYMBOL_1)+;
startExpression : andExpression;
andExpression : ( andnotExpression -> andnotExpression )
(AND? e = andnotExpression -> ^(IMPLICIT_AND $andExpression $e))*;
andnotExpression : orExpression (ANDNOT^ orExpression)*;
orExpression : notExpression (OR^ notExpression)*;
notExpression : (NOT^)? atomicExpression;
atomicExpression : PHRASE | WORD | L_PARENTHESIS! andExpression R_PARENTHESIS!;
I'm also logging the query strings that go along with these errors and they appear to be common, run-of-the-mill English search terms.
As to the error, the code point in question is not always \uDCAF but is consistent throughout the error cycle; it's always the same code point until we bounce the service, then when the error crops up again after a week of things working fine it's different.
All of the code points I've been able to record are part of a surrogate pair and not representative of valid glyphs on their own.
I'm an admitted ANTLR novice and don't know enough of its inner workings to diagnose much further than this. It almost seems to me that there's a singleton in the ANTLR runtime that gets screwed up somehow and renders all further processing useless until we reload the assemblies. I have no proof of this, however.
If you need more details or clarification, please don't hesitate to ask, because I'm at my wit's end.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论