秒差距匹配单个 unicode 字符

发布于 2024-12-21 18:04:06 字数 515 浏览 1 评论 0原文

我正在尝试创建一个解析器（使用parsec），它解析由换行符、逗号、分号和unicode破折号（ndash和mdash）分隔的标记：

authorParser = do
    name <- many1 (noneOf [',', ':', '\r', '\n', '\8212', '\8213'])
    many (char ',' <|> char ':' <|> char '-' <|> char '\8212' <|> char '\8213')

但是ndash-mdash（\8212，\8213）部分永远不会“成功”我得到无效的解析结果。

如何使用 char 解析器指定 unicode 破折号？

PS我也尝试过（chr 8212），（chr 8213）。这没有帮助。

添加：最好使用 Data.Text。从 ByteStrings 疯狂到 Data.Text 的转换节省了我大量的时间和“源空间”:)

原文

I'm trying to create a parser (with parsec), that parses tokens, delimited by newlines, commas, semicolons and unicode dashes (ndash and mdash):

authorParser = do
    name <- many1 (noneOf [',', ':', '\r', '\n', '\8212', '\8213'])
    many (char ',' <|> char ':' <|> char '-' <|> char '\8212' <|> char '\8213')

But the ndash-mdash (\8212, \8213) part never 'succeeds' and i'm getting invalid parse results.

How do i specify unicode dashes with char parser?

P.S. I've tried (chr 8212), (chr 8213) too. It doesn't helps.

ADDITION: It is better to use Data.Text. The switch from ByteStrings madness to Data.Text saved me a lot of time and 'source space' :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

肤浅与狂妄 2024-12-28 18:04:06

对我有用：

Prelude Text.ParserCombinators.Parsec> let authorName = do { name <- many1 (noneOf ",:\r\n\8212\8213"); many (oneOf ",:-\8212\8213"); }
Prelude Text.ParserCombinators.Parsec> parse authorName "" "my Name,\8212::-:\8213,"
Right ",\8212::-:\8213,"

你怎么尝试的？

上面使用的是纯 String，它可以正常工作，因为 Char 是一个完整的 uncode 代码点。对于其他类型的输入流来说，它并不那么好。 Text 可能也适用于此示例，我认为破折号被编码为单个代码单元。然而，对于ByteString来说，事情就更复杂了。如果您使用纯 Data.ByteString.Char8（严格或惰性，并不重要），则 Char 在打包时会被截断，仅保留最低有效的 8 位被保留，因此 '\8212' 变为 20，'\8213' 变为 21。如果输入流以相同方式构造，那仍然有效，只有所有 Char 都与20 或 21 模 256 将映射到与其中一个破折号相同的值。

但是，输入流很可能是 UTF-8 编码的，然后短划线分别编码为三个字节，即“\226\128\148”。 “\226\128\149”，与截断得到的结果不匹配。尝试使用 ByteString 和 parsec 解析 utf-8 编码的文本有点复杂，解析结果的组成单位不是单个字节，而是字节序列，每个长度为 1-4。

要使用noneOf，您需要一个

instance Text.Parsec.Prim.Stream ByteString m Char

which 做正确的事情。 Text.Parsec.ByteString[.Lazy] 中提供的实例没有，它使用 Data.ByteString[.Lazy].Char8 接口，因此一个破折号将成为与 '\8212' 不匹配的单个 '\20' 或在三个连续调用中产生三个 Chars、'\226'、'\128' 和 '\148' uncons，其中任何一个都不匹配“\8212”，具体取决于输入的编码方式。

Works for me:

Prelude Text.ParserCombinators.Parsec> let authorName = do { name <- many1 (noneOf ",:\r\n\8212\8213"); many (oneOf ",:-\8212\8213"); }
Prelude Text.ParserCombinators.Parsec> parse authorName "" "my Name,\8212::-:\8213,"
Right ",\8212::-:\8213,"

How did you try?

The above was using plain String, which works without problems because a Char is a full uncode code point. It's not as nice with other types of input stream. Text will probably also work well for this example, I think that the dashes are encoded as a single code unit there. For ByteString, however, things are more complicated. If you're using plain Data.ByteString.Char8 (strict or lazy, doesn't matter), the Chars get truncated on packing, only the least significant 8 bits are retained, so '\8212' becomes 20 and '\8213' becomes 21. If the input stream is constructed the same way, that still kind of works, only all Chars congruent to 20 or 21 modulo 256 will be mapped to the same as one of the dashes.

However, it is likely that the input stream is UTF-8 encoded, then the dashes are encoded as three bytes each, "\226\128\148" resp. "\226\128\149", which doesn't match what you get by truncating. Trying to parse utf-8 encoded text with ByteString and parsec is a bit more involved, the units of which the parse result is composed are not single bytes, but sequences of bytes, 1-4 in length each.

To use noneOf, you need an

instance Text.Parsec.Prim.Stream ByteString m Char

which does the right thing. The instance provided in Text.Parsec.ByteString[.Lazy] doesn't, it uses the Data.ByteString[.Lazy].Char8 interface, so an en-dash would become a single '\20' not matching '\8212' or produce three Chars, '\226', '\128' and '\148' in three successive calls to uncons, none of which matches '\8212' either, depending on how the input was encoded.

回复收藏 0 原文

~没有更多了~