秒差距匹配单个 unicode 字符
我正在尝试创建一个解析器(使用parsec),它解析由换行符、逗号、分号和unicode破折号(ndash和mdash)分隔的标记:
authorParser = do
name <- many1 (noneOf [',', ':', '\r', '\n', '\8212', '\8213'])
many (char ',' <|> char ':' <|> char '-' <|> char '\8212' <|> char '\8213')
但是ndash-mdash(\8212,\8213)部分永远不会“成功”我得到无效的解析结果。
如何使用 char 解析器指定 unicode 破折号?
PS我也尝试过(chr 8212),(chr 8213)。这没有帮助。
添加:最好使用 Data.Text。从 ByteStrings 疯狂到 Data.Text 的转换节省了我大量的时间和“源空间”:)
I'm trying to create a parser (with parsec), that parses tokens, delimited by newlines, commas, semicolons and unicode dashes (ndash and mdash):
authorParser = do
name <- many1 (noneOf [',', ':', '\r', '\n', '\8212', '\8213'])
many (char ',' <|> char ':' <|> char '-' <|> char '\8212' <|> char '\8213')
But the ndash-mdash (\8212, \8213) part never 'succeeds' and i'm getting invalid parse results.
How do i specify unicode dashes with char parser?
P.S. I've tried (chr 8212), (chr 8213) too. It doesn't helps.
ADDITION: It is better to use Data.Text. The switch from ByteStrings madness to Data.Text saved me a lot of time and 'source space' :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对我有用:
你怎么尝试的?
上面使用的是纯
String
,它可以正常工作,因为Char
是一个完整的 uncode 代码点。对于其他类型的输入流来说,它并不那么好。Text
可能也适用于此示例,我认为破折号被编码为单个代码单元。然而,对于ByteString
来说,事情就更复杂了。如果您使用纯Data.ByteString.Char8
(严格或惰性,并不重要),则Char
在打包时会被截断,仅保留最低有效的 8 位被保留,因此 '\8212' 变为 20,'\8213' 变为 21。如果输入流以相同方式构造,那仍然有效,只有所有Char
都与20 或 21 模 256 将映射到与其中一个破折号相同的值。但是,输入流很可能是 UTF-8 编码的,然后短划线分别编码为三个字节,即“\226\128\148”。 “\226\128\149”,与截断得到的结果不匹配。尝试使用
ByteString
和parsec
解析 utf-8 编码的文本有点复杂,解析结果的组成单位不是单个字节,而是字节序列,每个长度为 1-4。要使用
noneOf
,您需要一个which 做正确的事情。
Text.Parsec.ByteString[.Lazy]
中提供的实例没有,它使用Data.ByteString[.Lazy].Char8
接口,因此一个破折号将成为与 '\8212' 不匹配的单个 '\20' 或在三个连续调用中产生三个Chars
、'\226'、'\128' 和 '\148'uncons
,其中任何一个都不匹配“\8212”,具体取决于输入的编码方式。Works for me:
How did you try?
The above was using plain
String
, which works without problems because aChar
is a full uncode code point. It's not as nice with other types of input stream.Text
will probably also work well for this example, I think that the dashes are encoded as a single code unit there. ForByteString
, however, things are more complicated. If you're using plainData.ByteString.Char8
(strict or lazy, doesn't matter), theChar
s get truncated on packing, only the least significant 8 bits are retained, so '\8212' becomes 20 and '\8213' becomes 21. If the input stream is constructed the same way, that still kind of works, only allChar
s congruent to 20 or 21 modulo 256 will be mapped to the same as one of the dashes.However, it is likely that the input stream is UTF-8 encoded, then the dashes are encoded as three bytes each, "\226\128\148" resp. "\226\128\149", which doesn't match what you get by truncating. Trying to parse utf-8 encoded text with
ByteString
andparsec
is a bit more involved, the units of which the parse result is composed are not single bytes, but sequences of bytes, 1-4 in length each.To use
noneOf
, you need anwhich does the right thing. The instance provided in
Text.Parsec.ByteString[.Lazy]
doesn't, it uses theData.ByteString[.Lazy].Char8
interface, so an en-dash would become a single '\20' not matching '\8212' or produce threeChars
, '\226', '\128' and '\148' in three successive calls touncons
, none of which matches '\8212' either, depending on how the input was encoded.