haskell中的单词计数

发布于 2025-02-13 15:22:53 字数 1074 浏览 0 评论 0原文

我正在从事这项练习:

给定短语,计算该短语中每个单词的出现。

出于本练习的目的,您可以期望一个词永远是:

之一

由一个或多个ASCII数字(即“ 0”或“ 1234”)或 一个由一个或多个ASCII字母(即“ A”或“他们”)或 一个由单个撇号加入的两个简单单词的收缩(即“是”或“他们”) 计数单词时,您可以假设以下规则:

计数是不敏感的(即“你”,“你”和“你”是同一单词的3个用途) 计数是无序的;测试将忽略单词和计数的排序方式 除了收缩中的撇号外,所有形式的标点符号都被忽略 这些单词可以通过任何形式的空格(即“ \ t”,“ \ n”,“”)分开 例如,对于“那是密码:'密码123'!”的短语,请哭泣。代理。\ nso我逃了。计数将是:

那就是:1 :2 密码:2 123:1 哭了:1 特别:1 代理:1 所以:1 I:1 果:1

我的代码:

module WordCount (wordCount) where

import qualified Data.Char as C
import qualified Data.List as L
import Text.Regex.TDFA as R

wordCount :: String -> [(String, Int)]
wordCount xs =
  do
    ys <- words xs
    let zs = R.getAllTextMatches (ys =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
    g <- L.group $ L.sort [map (C.toLower) w | w <- zs]
    return (head g, length g)

但在“一条鱼两条鱼红鱼蓝鱼”的输入上失败了。它为每个单词,甚至是重复的单词都输出一个计数,好像排序和组没有做任何事情。为什么?

我已经阅读这个答案,它基本上以更高级的方式使用control.Arrow /代码>。

I'm working on this exercise:

Given a phrase, count the occurrences of each word in that phrase.

For the purposes of this exercise you can expect that a word will always be one of:

A number composed of one or more ASCII digits (ie "0" or "1234") OR
A simple word composed of one or more ASCII letters (ie "a" or "they") OR
A contraction of two simple words joined by a single apostrophe (ie "it's" or "they're")
When counting words you can assume the following rules:

The count is case insensitive (ie "You", "you", and "YOU" are 3 uses of the same word)
The count is unordered; the tests will ignore how words and counts are ordered
Other than the apostrophe in a contraction all forms of punctuation are ignored
The words can be separated by any form of whitespace (ie "\t", "\n", " ")
For example, for the phrase "That's the password: 'PASSWORD 123'!", cried the Special > Agent.\nSo I fled. the count would be:

that's: 1
the: 2
password: 2
123: 1
cried: 1
special: 1
agent: 1
so: 1
i: 1
fled: 1

My code:

module WordCount (wordCount) where

import qualified Data.Char as C
import qualified Data.List as L
import Text.Regex.TDFA as R

wordCount :: String -> [(String, Int)]
wordCount xs =
  do
    ys <- words xs
    let zs = R.getAllTextMatches (ys =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
    g <- L.group $ L.sort [map (C.toLower) w | w <- zs]
    return (head g, length g)

But it fails on the input "one fish two fish red fish blue fish". It outputs one count for each word, even the repeated ones, as if the sort and group aren't doing anything. Why?

I've read this answer, which basically does the same thing in a more advanced way using Control.Arrow.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

神也荒唐 2025-02-20 15:22:53

您无需使用words即可拆分行,正则应实现所需的分裂:

wordCount :: String -> [(String, Int)]
wordCount xs =
  do
    let zs = R.getAllTextMatches (xs =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
    g <- L.group $ L.sort [map C.toLower w | w <- zs]
    return (head g, length g)

You don't need to use words to split the line, the regex should achieve the desired splitting:

wordCount :: String -> [(String, Int)]
wordCount xs =
  do
    let zs = R.getAllTextMatches (xs =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
    g <- L.group $ L.sort [map C.toLower w | w <- zs]
    return (head g, length g)
仄言 2025-02-20 15:22:53
wordCount xs =
  do
    ys <- words xs
    let zs = R.getAllTextMatches (ys =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
    g <- L.group $ L.sort [map (C.toLower) w | w <- zs]
    return (head g, length g)

您正在使用Words将输入XS将其分成字。您在列表中使用绑定语句ys&lt; - …迭代这些。然后,您使用正则表达式将每个单词中的每个单词都分为子字,其中示例中只有一个匹配项。您将每个子词单独按列表中的每个子词进行排序。

我相信您本质上可以将初始调用删除Words

wordCount xs =
  do
    let ys = R.getAllTextMatches (xs =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
    g <- L.group $ L.sort [map C.toLower w | w <- ys]
    return (head g, length g)
wordCount xs =
  do
    ys <- words xs
    let zs = R.getAllTextMatches (ys =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
    g <- L.group $ L.sort [map (C.toLower) w | w <- zs]
    return (head g, length g)

You’re splitting the input xs into words by whitespace using words. You iterate over these in the list monad with the binding statement ys <- …. Then you split each of those words into subwords using the regular expression, of which there happens to be only one match in your example. You sort and group each of the subwords in a list by itself.

I believe you can essentially just delete the initial call to words:

wordCount xs =
  do
    let ys = R.getAllTextMatches (xs =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
    g <- L.group $ L.sort [map C.toLower w | w <- ys]
    return (head g, length g)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文