haskell中的单词计数
我正在从事这项练习:
给定短语,计算该短语中每个单词的出现。
出于本练习的目的,您可以期望一个词永远是:
之一由一个或多个ASCII数字(即“ 0”或“ 1234”)或 一个由一个或多个ASCII字母(即“ A”或“他们”)或 一个由单个撇号加入的两个简单单词的收缩(即“是”或“他们”) 计数单词时,您可以假设以下规则:
计数是不敏感的(即“你”,“你”和“你”是同一单词的3个用途) 计数是无序的;测试将忽略单词和计数的排序方式 除了收缩中的撇号外,所有形式的标点符号都被忽略 这些单词可以通过任何形式的空格(即“ \ t”,“ \ n”,“”)分开 例如,对于“那是密码:'密码123'!”的短语,请哭泣。代理。\ nso我逃了。计数将是:
那就是:1 :2 密码:2 123:1 哭了:1 特别:1 代理:1 所以:1 I:1 果:1
我的代码:
module WordCount (wordCount) where
import qualified Data.Char as C
import qualified Data.List as L
import Text.Regex.TDFA as R
wordCount :: String -> [(String, Int)]
wordCount xs =
do
ys <- words xs
let zs = R.getAllTextMatches (ys =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
g <- L.group $ L.sort [map (C.toLower) w | w <- zs]
return (head g, length g)
但在“一条鱼两条鱼红鱼蓝鱼”的输入上失败了。它为每个单词,甚至是重复的单词都输出一个计数,好像排序和组没有做任何事情。为什么?
我已经阅读这个答案,它基本上以更高级的方式使用control.Arrow /代码>。
I'm working on this exercise:
Given a phrase, count the occurrences of each word in that phrase.
For the purposes of this exercise you can expect that a word will always be one of:
A number composed of one or more ASCII digits (ie "0" or "1234") OR
A simple word composed of one or more ASCII letters (ie "a" or "they") OR
A contraction of two simple words joined by a single apostrophe (ie "it's" or "they're")
When counting words you can assume the following rules:The count is case insensitive (ie "You", "you", and "YOU" are 3 uses of the same word)
The count is unordered; the tests will ignore how words and counts are ordered
Other than the apostrophe in a contraction all forms of punctuation are ignored
The words can be separated by any form of whitespace (ie "\t", "\n", " ")
For example, for the phrase "That's the password: 'PASSWORD 123'!", cried the Special > Agent.\nSo I fled. the count would be:that's: 1
the: 2
password: 2
123: 1
cried: 1
special: 1
agent: 1
so: 1
i: 1
fled: 1
My code:
module WordCount (wordCount) where
import qualified Data.Char as C
import qualified Data.List as L
import Text.Regex.TDFA as R
wordCount :: String -> [(String, Int)]
wordCount xs =
do
ys <- words xs
let zs = R.getAllTextMatches (ys =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
g <- L.group $ L.sort [map (C.toLower) w | w <- zs]
return (head g, length g)
But it fails on the input "one fish two fish red fish blue fish". It outputs one count for each word, even the repeated ones, as if the sort and group aren't doing anything. Why?
I've read this answer, which basically does the same thing in a more advanced way using Control.Arrow
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您无需使用
words
即可拆分行,正则应实现所需的分裂:You don't need to use
words
to split the line, the regex should achieve the desired splitting:您正在使用
Words
将输入XS
将其分成字。您在列表中使用绑定语句ys&lt; - …
迭代这些。然后,您使用正则表达式将每个单词中的每个单词都分为子字,其中示例中只有一个匹配项。您将每个子词单独按列表中的每个子词进行排序。我相信您本质上可以将初始调用删除
Words
:You’re splitting the input
xs
into words by whitespace usingwords
. You iterate over these in the list monad with the binding statementys <- …
. Then you split each of those words into subwords using the regular expression, of which there happens to be only one match in your example. You sort and group each of the subwords in a list by itself.I believe you can essentially just delete the initial call to
words
: