实现读取类型类，其中解析字符串包含“$”

发布于 2024-12-05 03:48:31 字数 798 浏览 3 评论 0原文

我已经使用 Haskell 大约一个月了。对于我的第一个“真正的”Haskell 项目，我正在编写一个词性标注器。作为该项目的一部分，我有一个名为 Tag 的类型，它表示词性标记，实现如下：

data Tag = CC | CD | DT | EX | FW | IN | JJ | JJR | JJS ...

上面是一长串标准化词性标记，我已经将其实现故意截断。然而，在这套标准标签中，有两个以美元符号 ($) 结尾：PRP$ 和 NNP$。因为我不能使用名称中包含 $ 的类型构造函数，所以我选择将它们重命名为 PRPS 和 NNPS。

这一切都很好，但我想从词典中的字符串中读取标签并将它们转换为我的 Tag 类型。尝试此操作失败：

instance Read Tag where
    readsPrec _ input =
        (\inp -> [((NNPS), rest) | ("NNP$", rest) <- lex inp]) input

Haskell 词法分析器因 $ 而卡住。有什么想法如何实现这一目标吗？

实施 Show 相当简单。如果 Read 也有类似的策略，那就太好了。

instance Show Tag where
    showsPrec _ NNPS = showString "NNP$"
    showsPrec _ PRPS = showString "PRP$"
    showsPrec _ tag  = shows tag

原文

I've been playing with Haskell for about a month. For my first "real" Haskell project I'm writing a parts-of-speech tagger. As part of this project I have a type called Tag that represents a parts-of-speech tag, implemented as follows:

data Tag = CC | CD | DT | EX | FW | IN | JJ | JJR | JJS ...

The above is a long list of standardized parts-of-speech tags which I've intentionally truncated. However, in this standard set of tags there are two that end in a dollar sign ($): PRP$ and NNP$. Because I can't have type constructors with $ in their name, I've elected to rename them PRPS and NNPS.

This is all well and good, but I'd like to read tags from strings in a lexicon and convert them to my Tag type. Trying this fails:

instance Read Tag where
    readsPrec _ input =
        (\inp -> [((NNPS), rest) | ("NNP$", rest) <- lex inp]) input

The Haskell lexer chokes on the $. Any ideas how to pull this off?

Implementing Show was fairly straightforward. It would be great if there were some similar strategy for Read.

instance Show Tag where
    showsPrec _ NNPS = showString "NNP$"
    showsPrec _ PRPS = showString "PRP$"
    showsPrec _ tag  = shows tag

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暖伴 2024-12-12 03:48:31

您在这里滥用了Read。

Show 和 Read 旨在打印和解析有效的 Haskell 值，以启用调试等。这并不总是完美的（例如，如果您导入 Data.Map 合格，然后对 Map 值调用 show，对 fromList 的调用不合格），但它是一个有效的起点。

如果您想打印或解析您的值以匹配某些特定格式，请为前者使用漂亮的打印库，为后者使用实际的解析库（例如 uu-parsinglib、polyparse、parsec 等）。它们通常比 ReadS 提供更好的解析支持（尽管 GHC 中的 ReadP 还不错）。

虽然您可能会说这是没有必要的，但这只是您正在做的快速而肮脏的黑客行为，快速而肮脏的黑客行为往往会徘徊......帮自己一个忙，并正确地做事第一次：这意味着当您以后想要“正确”地进行操作时，需要重写的内容会更少。

回复收藏 0 原文

任性一次 2024-12-12 03:48:31

那么就不要使用 Haskell 词法分析器。 read 函数使用 ParSec，您可以在 Real World Haskell 书中找到关于 ParSec 的精彩介绍。

这是一些似乎可以工作的代码，

import Text.Read
import Text.ParserCombinators.ReadP hiding (choice)
import Text.ParserCombinators.ReadPrec hiding (choice)

data Tag = CC | CD | DT | EX | FW | IN | JJ | JJR | JJS deriving (Show)

strValMap = map (\(x, y) -> lift $ string x >> return y)

instance Read Tag where
    readPrec = choice $ strValMap [
        ("CC", CC),
        ("CD", CD),
        ("JJ$", JJS)
        ]

只需运行它即可

(read "JJ$") :: Tag

代码非常不言自明。 string x 解析器 monad 匹配 x，如果成功（不抛出异常），则返回 y。我们使用 choice 在所有这些中进行选择。它将适当回溯，因此如果您添加一个 CCC 构造函数，则部分匹配“CCC”的 CC 稍后会失败，并且它将回溯到 CCC。当然，如果您不需要这个，则使用 <|> 组合器。

Don't use the Haskell lexer then. The read functions use ParSec, which you can find an excellent introduction to in the Real World Haskell book.

Here's some code that seems to work,

import Text.Read
import Text.ParserCombinators.ReadP hiding (choice)
import Text.ParserCombinators.ReadPrec hiding (choice)

data Tag = CC | CD | DT | EX | FW | IN | JJ | JJR | JJS deriving (Show)

strValMap = map (\(x, y) -> lift $ string x >> return y)

instance Read Tag where
    readPrec = choice $ strValMap [
        ("CC", CC),
        ("CD", CD),
        ("JJ$", JJS)
        ]

just run it with

(read "JJ$") :: Tag

The code is pretty self explanatory. The string x parser monad matches x, and if it succeeds (doesn't throw an exception), then y is returned. We use choice to select among all of these. It will backtrack appropriately, so if you add a CCC constructor, then CC partially matching "CCC" will fail later, and it will backtrack to CCC. Of course, if you don't need this, then use the <|> combinator.