Code Golf:从文本快速构建关键字列表,包括实例数

发布于 2024-07-25 13:14:43 字数 853 浏览 9 评论 0原文

我已经用 PHP 为自己制定了这个解决方案,但我很好奇如何以不同的方式完成它 - 甚至更好。 我主要感兴趣的两种语言是 PHP 和 Javascript,但我有兴趣看看当今任何其他主要语言(主要是 C#、Java 等)可以多快地完成此操作。

  1. 仅返回出现次数大于 X 的单词
  2. 仅返回长度大于 Y 的单词
  3. 忽略诸如“and、is、the 等”之类的常用术语
  4. 在处理之前随意删除标点符号(即“John's”变为“John”)
  5. 在集合/数组中返回结果

Extra Credit

  1. 将引用的语句放在一起,(即“它们显然‘好得令人难以置信’”)
    其中“好得令人难以置信”将是实际语句

Extra-Extra Credit

  1. 您的脚本能否根据单词出现在一起的频率来确定应该放在一起的单词? 这是在事先不知道单词的情况下完成的。 例子:
    *“果蝇在医学研究方面是一件伟大的事情。过去人们对果蝇进行了大量研究,并取得了许多突破。未来,果蝇将继续被研究研究过,但我们的方法可能会改变。”*
    显然这里的词是“果蝇”,我们很容易找到。 您的 search'n'scrape 脚本也能确定这一点吗?

源文本:http://sampsonresume.com/labs/c.txt< /a>

答案格式

  1. 如果能够看到代码的结果、输出以及操作持续的时间,那就太棒了。

I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).

  1. Return only words with an occurrence greater than X
  2. Return only words with a length greater than Y
  3. Ignore common terms like "and, is, the, etc"
  4. Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
  5. Return results in a collection/array

Extra Credit

  1. Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")
    Where 'too good to be true' would be the actual statement

Extra-Extra Credit

  1. Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:

    *"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."*

    Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?

Source text: http://sampsonresume.com/labs/c.txt

Answer Format

  1. It would be great to see the results of your code, output, in addition to how long the operation lasted.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

和我恋爱吧 2024-08-01 13:14:43

GNU 脚本

sed -e 's/ /\n/g' | grep -v '^ *

结果:

  7 be
  6 to
[...]
  1 2.
  1 -

出现次数大于 X:

sed -e 's/ /\n/g' | grep -v '^ *

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

sed -e 's/ /\n/g' | grep -v '^ *

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

sed -e 's/ /\n/g' | grep -v '^ *

在处理之前随意删除标点符号(即“John's”变成“John”):

sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:


出现次数大于 X:


仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):


忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)


在处理之前随意删除标点符号(即“John's”变成“John”):


在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | awk '$1>X'

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):


忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)


在处理之前随意删除标点符号(即“John's”变成“John”):


在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:


出现次数大于 X:


仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):


忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)


在处理之前随意删除标点符号(即“John's”变成“John”):


在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| grep .... | sort | uniq -c

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)


在处理之前随意删除标点符号(即“John's”变成“John”):


在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | awk '$1>X'

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| grep -vf ignored | sort | uniq -c

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | awk '$1>X'

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| grep .... | sort | uniq -c

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | awk '$1>X'

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | awk '$1>X'

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| grep .... | sort | uniq -c

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | awk '$1>X'

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| grep -vf ignored | sort | uniq -c

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | awk '$1>X'

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| grep .... | sort | uniq -c

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | awk '$1>X'

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

| sort | uniq -c | sort -nr

结果:

出现次数大于 X:

仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):

忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)

在处理之前随意删除标点符号(即“John's”变成“John”):

在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。

GNU scripting

sed -e 's/ /\n/g' | grep -v '^ *

Results:

  7 be
  6 to
[...]
  1 2.
  1 -

With occurence greater than X:

sed -e 's/ /\n/g' | grep -v '^ *

Return only words with a length greater than Y (put Y+1 dots in second grep):

sed -e 's/ /\n/g' | grep -v '^ *

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

sed -e 's/ /\n/g' | grep -v '^ *

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:


With occurence greater than X:


Return only words with a length greater than Y (put Y+1 dots in second grep):


Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')


Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):


Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | awk '$1>X'

Return only words with a length greater than Y (put Y+1 dots in second grep):


Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')


Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):


Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:


With occurence greater than X:


Return only words with a length greater than Y (put Y+1 dots in second grep):


Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')


Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):


Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| grep .... | sort | uniq -c

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')


Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):


Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | awk '$1>X'

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| grep -vf ignored | sort | uniq -c

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | awk '$1>X'

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| grep .... | sort | uniq -c

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | awk '$1>X'

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | awk '$1>X'

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| grep .... | sort | uniq -c

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | awk '$1>X'

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| grep -vf ignored | sort | uniq -c

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | awk '$1>X'

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| grep .... | sort | uniq -c

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | awk '$1>X'

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

| sort | uniq -c | sort -nr

Results:

With occurence greater than X:

Return only words with a length greater than Y (put Y+1 dots in second grep):

Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')

Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):

Return results in a collection/array: it is already like an array for shell: first column is count, second is word.

离去的眼神 2024-08-01 13:14:43

Perl 只有 43 个字符。

perl -MYAML -anE'$_{$_}++for@F;say Dump\%_'

下面是它的使用示例:

echo a a a b b c  d e aa | perl -MYAML -anE'$_{$_}++for@F;say Dump \%_'

---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1

如果您只需要列出小写版本,则还需要两个字符。

perl -MYAML -anE'$_{lc$_}++for@F;say Dump\%_'

为了使其能够处理指定的文本,需要 58 个字符。

curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}++for@F;END{say Dump\%_}'
real    0m0.679s
user    0m0.304s
sys     0m0.084s

这是最后一个示例,稍作扩展。

#! perl
use 5.010;
use YAML;

while( my $line = <> ){
  for my $elem ( split '\W+', $line ){
    $_{ lc $elem }++
  }
  END{
    say Dump \%_;
  }
}

Perl in only 43 characters.

perl -MYAML -anE'$_{$_}++for@F;say Dump\%_'

Here is an example of it's use:

echo a a a b b c  d e aa | perl -MYAML -anE'$_{$_}++for@F;say Dump \%_'

---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1

If you need to list only the lowercase versions, it requires two more characters.

perl -MYAML -anE'$_{lc$_}++for@F;say Dump\%_'

For it to work on the specified text requires 58 characters.

curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}++for@F;END{say Dump\%_}'
real    0m0.679s
user    0m0.304s
sys     0m0.084s

Here is the last example expanded a bit.

#! perl
use 5.010;
use YAML;

while( my $line = <> ){
  for my $elem ( split '\W+', $line ){
    $_{ lc $elem }++
  }
  END{
    say Dump \%_;
  }
}
晨光如昨 2024-08-01 13:14:43

F#:304 个字符

let f =
    let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
    fun length occurrence msg ->
        System.Text.RegularExpressions.Regex.Split(msg, @"[^\w-']+")
        |> Seq.countBy (fun a -> a)
        |> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)

F#: 304 chars

let f =
    let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
    fun length occurrence msg ->
        System.Text.RegularExpressions.Regex.Split(msg, @"[^\w-']+")
        |> Seq.countBy (fun a -> a)
        |> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)
樱花坊 2024-08-01 13:14:43

Ruby

当“缩小”时,此实现的长度变为 165 个字符。 它使用 array#inject 给出一个起始值(默认值为 0 的 Hash 对象),然后循环遍历元素,然后将其滚动到哈希中; 然后从最小频率中选择结果。

请注意,我没有计算要跳过的单词的大小,这是一个外部常量。 如果把常量也算进去,则解的长度为244个字符。

撇号和破折号不会被删除,而是包含在内; 它们的使用会修改单词,因此如果不删除符号之外的所有信息,就不能简单地剥离它们。

实施

CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
def get_keywords(text, minFreq=0, minLen=2)
  text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
    inject(Hash.new(0)) do |result,w|
      w.downcase!
      result[w] += 1 unless CommonWords.include?(w)
      result
    end.select { |k,n| n >= minFreq }
end

测试台

require 'net/http'

keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3)
keywords.sort.each { |name,count| puts "#{name} x #{count} times" }

测试结果

code x 4 times
declarations x 4 times
each x 3 times
execution x 3 times
expression x 4 times
function x 5 times
keywords x 3 times
language x 3 times
languages x 3 times
new x 3 times
operators x 4 times
programming x 3 times
statement x 7 times
statements x 4 times
such x 3 times
types x 3 times
variables x 3 times
which x 4 times

Ruby

When "minified", this implementation becomes 165 characters long. It uses array#inject to give a starting value (a Hash object with a default of 0) and then loop through the elements, which are then rolled into the hash; the result is then selected from the minimum frequency.

Note that I didn't count the size of the words to skip, that being an external constant. When the constant is counted too, the solution is 244 characters long.

Apostrophes and dashes aren't stripped, but included; their use modifies the word and therefore cannot be stripped simply without removal of all information beyond the symbol.

Implementation

CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
def get_keywords(text, minFreq=0, minLen=2)
  text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
    inject(Hash.new(0)) do |result,w|
      w.downcase!
      result[w] += 1 unless CommonWords.include?(w)
      result
    end.select { |k,n| n >= minFreq }
end

Test Rig

require 'net/http'

keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3)
keywords.sort.each { |name,count| puts "#{name} x #{count} times" }

Test Results

code x 4 times
declarations x 4 times
each x 3 times
execution x 3 times
expression x 4 times
function x 5 times
keywords x 3 times
language x 3 times
languages x 3 times
new x 3 times
operators x 4 times
programming x 3 times
statement x 7 times
statements x 4 times
such x 3 times
types x 3 times
variables x 3 times
which x 4 times
睡美人的小仙女 2024-08-01 13:14:43

C# 3.0(使用 LINQ)

这是我的解决方案。 它利用 LINQ/扩展方法的一些非常好的功能来保持代码简短。

public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
    var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
        "for", "by", "an", "be", "may", "has", "can", "its"};
    var words = Regex.Replace(text.ToLower(), @"[,.?\/;:\(\)]", string.Empty).Split(' ');
    var occurrences = words.Distinct().Except(commonWords).Select(w =>
        new { Word = w, Count = words.Count(s => s == w) });
    return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
        .ToDictionary(wo => wo.Word, wo => wo.Count);
}

然而,这距离最有效的方法还很远,因为单词数是 O(n^2),而不是在本例中最佳的 O(n)我相信。 我会看看是否可以创建一个稍微长一点、更有效的方法。

以下是函数在示例文本上运行的结果(最少出现次数:3,最小长度:2)。

  3 x such
  4 x code
  4 x which
  4 x declarations
  5 x function
  4 x statements
  3 x new
  3 x types
  3 x keywords
  7 x statement
  3 x language
  3 x expression
  3 x execution
  3 x programming
  4 x operators
  3 x variables

还有我的测试程序:

static void Main(string[] args)
{
    string sampleText;
    using (var client = new WebClient())
        sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
    var keywords = GetKeywords(sampleText, 3, 2);
    foreach (var entry in keywords)
        Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
    Console.ReadKey(true);
}

C# 3.0 (with LINQ)

Here's my solution. It makes use of some pretty nice features of LINQ/extension methods to keep the code short.

public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
    var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
        "for", "by", "an", "be", "may", "has", "can", "its"};
    var words = Regex.Replace(text.ToLower(), @"[,.?\/;:\(\)]", string.Empty).Split(' ');
    var occurrences = words.Distinct().Except(commonWords).Select(w =>
        new { Word = w, Count = words.Count(s => s == w) });
    return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
        .ToDictionary(wo => wo.Word, wo => wo.Count);
}

This is however far from the most efficient method, being O(n^2) with the number of words, rather than O(n), which is optimal in this case I believe. I'll see if I can creater a slightly longer method that is more efficient.

Here are the results of the function run on the sample text (min occurences: 3, min length: 2).

  3 x such
  4 x code
  4 x which
  4 x declarations
  5 x function
  4 x statements
  3 x new
  3 x types
  3 x keywords
  7 x statement
  3 x language
  3 x expression
  3 x execution
  3 x programming
  4 x operators
  3 x variables

And my test program:

static void Main(string[] args)
{
    string sampleText;
    using (var client = new WebClient())
        sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
    var keywords = GetKeywords(sampleText, 3, 2);
    foreach (var entry in keywords)
        Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
    Console.ReadKey(true);
}
美人骨 2024-08-01 13:14:43
#! perl
use strict;
use warnings;

while (<>) {
  for my $word (split) {
    $words{$word}++;
  }
}
for my $word (keys %words) {
  print "$word occurred $words{$word} times.";
}

这就是简单的形式。 如果您想要排序、过滤等:

while (<>) {
  for my $word (split) {
    $words{$word}++;
  }
}
for my $word (keys %words) {
  if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
    print "$word occurred $words{$word} times.";
  }
}

您还可以非常轻松地对输出进行排序:

...
for my $word (keys %words) {
  if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
    push @output, "$word occurred $words{$word} times.";
  }
}
$re = qr/occurred (\d+) /;
print sort {
  $a = $a =~ $re;
  $b = $b =~ $re;
  $a <=> $b
} @output;

真正的 Perl 黑客可以轻松地在每行一两行中获得这些内容,但我追求可读性。


Brad

编辑:这就是我重写最后一个示例的方式

...
for my $word (
  sort { $words{$a} <=> $words{$b} } keys %words
){
  next unless length($word) >= $MINLEN;
  last unless $words{$word) >= $MIN_OCCURRENCE;

  print "$word occurred $words{$word} times.";
}

或者如果我需要它运行得更快我什至可以这样写:

for my $word_data (
  sort {
    $a->[1] <=> $b->[1] # numerical sort on count
  } grep {
    # remove values that are out of bounds
    length($_->[0]) >= $MINLEN &&      # word length
    $_->[1] >= $MIN_OCCURRENCE # count
  } map {
    # [ word, count ]
    [ $_, $words{$_} ]
  } keys %words
){
  my( $word, $count ) = @$word_data;
  print "$word occurred $count times.";
}

它使用地图来提高效率,
grep 删除多余的元素,
当然,还有 sort 来进行排序。 (它按照该顺序执行)

这是 施瓦茨变换

#! perl
use strict;
use warnings;

while (<>) {
  for my $word (split) {
    $words{$word}++;
  }
}
for my $word (keys %words) {
  print "$word occurred $words{$word} times.";
}

That's the simple form. If you want sorting, filtering, etc.:

while (<>) {
  for my $word (split) {
    $words{$word}++;
  }
}
for my $word (keys %words) {
  if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
    print "$word occurred $words{$word} times.";
  }
}

You can also sort the output pretty easily:

...
for my $word (keys %words) {
  if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
    push @output, "$word occurred $words{$word} times.";
  }
}
$re = qr/occurred (\d+) /;
print sort {
  $a = $a =~ $re;
  $b = $b =~ $re;
  $a <=> $b
} @output;

A true Perl hacker will easily get these on one or two lines each, but I went for readability.


Brad

Edit: this is how I would rewrite this last example

...
for my $word (
  sort { $words{$a} <=> $words{$b} } keys %words
){
  next unless length($word) >= $MINLEN;
  last unless $words{$word) >= $MIN_OCCURRENCE;

  print "$word occurred $words{$word} times.";
}

Or if I needed it to run faster I might even write it like this:

for my $word_data (
  sort {
    $a->[1] <=> $b->[1] # numerical sort on count
  } grep {
    # remove values that are out of bounds
    length($_->[0]) >= $MINLEN &&      # word length
    $_->[1] >= $MIN_OCCURRENCE # count
  } map {
    # [ word, count ]
    [ $_, $words{$_} ]
  } keys %words
){
  my( $word, $count ) = @$word_data;
  print "$word occurred $count times.";
}

It uses map for efficiency,
grep to remove extra elements,
and sort to do the sorting, of course. ( it does so it in that order )

This is a slight variant of the Schwartzian transform.

不疑不惑不回忆 2024-08-01 13:14:43

另一个 Python 解决方案,有 247 个字符。 实际代码是一行 134 个字符的高密度 Python 行,它在单个表达式中计算整个过程。

x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
    gb(sorted(open("c.txt").read().lower().split())))
    if l>x and len(w)>y and w not in W)

更长的版本,包含大量注释,供您阅读乐趣:

# High and low count boundaries.
x = 3
y = 2

# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()

# A special function that groups similar strings in a list into a 
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby

# Reads the entire file, converts it to lower case and splits on whitespace 
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())

# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper)) 
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))

# Filters the words by number of occurences and common words using yet another 
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)

# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)

print result

这里的主要技巧是使用 itertools.groupby 函数来计算排序列表中的出现次数。 不知道它是否真的节省了字符,但它确实允许所有处理发生在单个表达式中。

结果:

{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}

Another Python solution, at 247 chars. The actual code is a single line of highly dense Python line of 134 chars that computes the whole thing in a single expression.

x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
    gb(sorted(open("c.txt").read().lower().split())))
    if l>x and len(w)>y and w not in W)

A much longer version with plenty of comments for you reading pleasure:

# High and low count boundaries.
x = 3
y = 2

# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()

# A special function that groups similar strings in a list into a 
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby

# Reads the entire file, converts it to lower case and splits on whitespace 
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())

# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper)) 
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))

# Filters the words by number of occurences and common words using yet another 
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)

# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)

print result

The main trick here is using the itertools.groupby function to count the occurrences on a sorted list. Don't know if it really saves characters, but it does allow all the processing to happen in a single expression.

Results:

{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}
夜清冷一曲。 2024-08-01 13:14:43

C# 代码:

IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
{
    // common words, that will be ignored
    var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
    // regular expression to find quoted text
    var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);

    return
        // remove quoted text (it will be processed later)
        regex.Replace(text, "")
        // remove case dependency
        .ToLower()
        // split text by all these chars
        .Split(".,'\\/[]{}()`~@#$%^&*-=+?!;:<>| \n\r".ToCharArray())
        // add quoted text
        .Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
        // group words by the word and count them
        .GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
        // apply filter(min word count and word length) and remove common words 
        .Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
}

ProcessText(text, 3, 2) 调用的输出:

3 x languages
3 x such
4 x code
4 x which
3 x based
3 x each
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
3 x variables
7 x statement
4 x expression
3 x execution
3 x programming
3 x operators

C# code:

IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
{
    // common words, that will be ignored
    var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
    // regular expression to find quoted text
    var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);

    return
        // remove quoted text (it will be processed later)
        regex.Replace(text, "")
        // remove case dependency
        .ToLower()
        // split text by all these chars
        .Split(".,'\\/[]{}()`~@#$%^&*-=+?!;:<>| \n\r".ToCharArray())
        // add quoted text
        .Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
        // group words by the word and count them
        .GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
        // apply filter(min word count and word length) and remove common words 
        .Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
}

Output for ProcessText(text, 3, 2) call:

3 x languages
3 x such
4 x code
4 x which
3 x based
3 x each
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
3 x variables
7 x statement
4 x expression
3 x execution
3 x programming
3 x operators
浅紫色的梦幻 2024-08-01 13:14:43

在 C# 中:

  1. 使用 LINQ,特别是 groupby,然后按组计数进行筛选,并返回展平的 (selectmany) 列表。

    使用 LINQ

  2. 使用 LINQ,按长度过滤。

  3. 使用 LINQ,使用“badwords”进行过滤。包含。

In C#:

  1. Use LINQ, specifically groupby, then filter by group count, and return a flattened (selectmany) list.

  2. Use LINQ, filter by length.

  3. Use LINQ, filter with 'badwords'.Contains.

好菇凉咱不稀罕他 2024-08-01 13:14:43

REBOL

Verbose,也许,所以绝对不是赢家,但完成了工作。

min-length: 0
min-count: 0

common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]

add-word: func [
    word [string!]
    /local
        count
        letter
        non-letter
        temp
        rules
        match
][    
    ; Strip out punctuation
    temp: copy {}
    letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
    non-letter: complement letter
    rules: [
        some [
            copy match letter (append temp match)
            |
            non-letter
        ]
    ]
    parse/all word rules
    word: temp

    ; If we end up with nothing, bail
    if 0 == length? word [
        exit
    ]

    ; Check length
    if min-length > length? word [
        exit
    ]

    ; Ignore common words
    ignore: 
    if find common-words word [
        exit
    ]

    ; OK, its good. Add it.
    either found? count: select words word [
        words/(word): count + 1
    ][
        repend words [word 1]
    ]
]

rules: [
    some [
        {"}
        copy word to {"} (add-word word)
        {"}
        |
        copy word to { } (add-word word)
        { }
    ]
    end
]

words: copy []
parse/all read %c.txt rules

result: copy []
foreach word words [
    if string? word [
        count: words/:word
        if count >= min-count [
            append result word
        ]
    ]
]

sort result
foreach word result [ print word ]

输出是:

act
actions
all
allows
also
any
appear
arbitrary
arguments
assign
assigned
based
be
because
been
before
below
between
braces
branches
break
builtin
but
C
C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
call
called
calls
can
care
case
char
code
columnbased
comma
Comments
common
compiler
conditional
consisting
contain
contains
continue
control
controlflow
criticized
Cs
curly brackets
declarations
define
definitions
degree
delimiters
designated
directly
dowhile
each
effect
effects
either
enclosed
enclosing
end
entry
enum
evaluated
evaluation
evaluations
even
example
executed
execution
exert
expression
expressionExpressions
expressions
familiarity
file
followed
following
format
FORTRAN
freeform
function
functions
goto
has
high
However
identified
ifelse
imperative
include
including
initialization
innermost
int
integer
interleaved
Introduction
iterative
Kernighan
keywords
label
language
languages
languagesAlthough
leave
limit
lineEach
loop
looping
many
may
mimicked
modify
more
most
name
needed
new
next
nonstructured
normal
object
obtain
occur
often
omitted
on
operands
operator
operators
optimization
order
other
perhaps
permits
points
programmers
programming
provides
rather
reinitialization
reliable
requires
reserve
reserved
restrictions
results
return
Ritchie
say
scope
Sections
see
selects
semicolon
separate
sequence
sequence point
sequential
several
side
single
skip
sometimes
source
specify
statement
statements
storage
struct
Structured
structuresAs
such
supported
switch
syntax
testing
textlinebased
than
There
This
turn
type
types
union
Unlike
unspecified
use
used
uses
using
usually
value
values
variable
variables
variety
which
while
whitespace
widespread
will
within
writing

REBOL

Verbose, perhaps, so definitely not a winner, but gets the job done.

min-length: 0
min-count: 0

common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]

add-word: func [
    word [string!]
    /local
        count
        letter
        non-letter
        temp
        rules
        match
][    
    ; Strip out punctuation
    temp: copy {}
    letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
    non-letter: complement letter
    rules: [
        some [
            copy match letter (append temp match)
            |
            non-letter
        ]
    ]
    parse/all word rules
    word: temp

    ; If we end up with nothing, bail
    if 0 == length? word [
        exit
    ]

    ; Check length
    if min-length > length? word [
        exit
    ]

    ; Ignore common words
    ignore: 
    if find common-words word [
        exit
    ]

    ; OK, its good. Add it.
    either found? count: select words word [
        words/(word): count + 1
    ][
        repend words [word 1]
    ]
]

rules: [
    some [
        {"}
        copy word to {"} (add-word word)
        {"}
        |
        copy word to { } (add-word word)
        { }
    ]
    end
]

words: copy []
parse/all read %c.txt rules

result: copy []
foreach word words [
    if string? word [
        count: words/:word
        if count >= min-count [
            append result word
        ]
    ]
]

sort result
foreach word result [ print word ]

The output is:

act
actions
all
allows
also
any
appear
arbitrary
arguments
assign
assigned
based
be
because
been
before
below
between
braces
branches
break
builtin
but
C
C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
call
called
calls
can
care
case
char
code
columnbased
comma
Comments
common
compiler
conditional
consisting
contain
contains
continue
control
controlflow
criticized
Cs
curly brackets
declarations
define
definitions
degree
delimiters
designated
directly
dowhile
each
effect
effects
either
enclosed
enclosing
end
entry
enum
evaluated
evaluation
evaluations
even
example
executed
execution
exert
expression
expressionExpressions
expressions
familiarity
file
followed
following
format
FORTRAN
freeform
function
functions
goto
has
high
However
identified
ifelse
imperative
include
including
initialization
innermost
int
integer
interleaved
Introduction
iterative
Kernighan
keywords
label
language
languages
languagesAlthough
leave
limit
lineEach
loop
looping
many
may
mimicked
modify
more
most
name
needed
new
next
nonstructured
normal
object
obtain
occur
often
omitted
on
operands
operator
operators
optimization
order
other
perhaps
permits
points
programmers
programming
provides
rather
reinitialization
reliable
requires
reserve
reserved
restrictions
results
return
Ritchie
say
scope
Sections
see
selects
semicolon
separate
sequence
sequence point
sequential
several
side
single
skip
sometimes
source
specify
statement
statements
storage
struct
Structured
structuresAs
such
supported
switch
syntax
testing
textlinebased
than
There
This
turn
type
types
union
Unlike
unspecified
use
used
uses
using
usually
value
values
variable
variables
variety
which
while
whitespace
widespread
will
within
writing
年少掌心 2024-08-01 13:14:43

Python (按原样 258 个字符,包括第一行 66 个字符和用于标点符号删除的 30 个字符):

W="and is the as of to or in for by an be may has can its".split()
x=3;y=2;d={}
for l in open('c.txt') :
    for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
        if w not in W: d[w]=d.get(w,0)+1
for w,n in d.items() :
    if n>y and len(w)>x : print n,w

输出:

4 code
3 keywords
3 languages
3 execution
3 each
3 language
4 expression
4 statements
3 variables
7 statement
5 function
4 operators
4 declarations
3 programming
4 which
3 such
3 types

Python (258 chars as is, including 66 chars for first line and 30 chars for punctuation removal) :

W="and is the as of to or in for by an be may has can its".split()
x=3;y=2;d={}
for l in open('c.txt') :
    for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
        if w not in W: d[w]=d.get(w,0)+1
for w,n in d.items() :
    if n>y and len(w)>x : print n,w

output :

4 code
3 keywords
3 languages
3 execution
3 each
3 language
4 expression
4 statements
3 variables
7 statement
5 function
4 operators
4 declarations
3 programming
4 which
3 such
3 types
傲娇萝莉攻 2024-08-01 13:14:43

这是我的 PHP 变体:

$str = implode(file('c.txt'));
$tok = strtok($str, " .,;()\r\n\t");

$splitters = '\s.,\(\);?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );

foreach($array as $key) {
    $res[$key] = $res[$key]+1;
}

$splitters = '\s.,\(\)\{\};?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );

foreach($array as $key) {
    $res[$key] = $res[$key]+1;
}

unset($res['the']);
unset($res['and']);
unset($res['to']);
unset($res['of']);
unset($res['by']);
unset($res['a']);
unset($res['as']);
unset($res['is']);
unset($res['in']);
unset($res['']);

arsort($res);
//var_dump($res); // concordance
foreach ($res AS $word => $rarity)
    echo $word . ' <b>x</b> ' . $rarity . '<br/>';

foreach ($array as $word) { // words longer than n (=5)
//    if(strlen($word) > 5)echo $word.'<br/>';
}

输出:

statement x 7
be x 7
C x 5
may x 5
for x 5
or x 5
The x 5
as x 5
expression x 4
statements x 4
code x 4
function x 4
which x 4
an x 4
declarations x 3
new x 3
execution x 3
types x 3
such x 3
variables x 3
can x 3
languages x 3
operators x 3
end x 2
programming x 2
evaluated x 2
functions x 2
definitions x 2
keywords x 2
followed x 2
contain x 2
several x 2
side x 2
most x 2
has x 2
its x 2
called x 2
specify x 2
reinitialization x 2
use x 2
either x 2
each x 2
all x 2
built-in x 2
source x 2
are x 2
storage x 2
than x 2
effects x 1
including x 1
arguments x 1
order x 1
even x 1
unspecified x 1
evaluations x 1
operands x 1
interleaved x 1
However x 1
value x 1
branches x 1
goto x 1
directly x 1
designated x 1
label x 1
non-structured x 1
also x 1
enclosing x 1
innermost x 1
loop x 1
skip x 1
There x 1
within x 1
switch x 1
Expressions x 1
integer x 1
variety x 1
see x 1
below x 1
will x 1
on x 1
selects x 1
case x 1
executed x 1
based x 1
calls x 1
from x 1
because x 1
many x 1
widespread x 1
familiarity x 1
C's x 1
mimicked x 1
Although x 1
reliable x 1
obtain x 1
results x 1
needed x 1
other x 1
syntax x 1
often x 1
Introduction x 1
say x 1
Programming x 1
Language x 1
C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1
Ritchie x 1
Kernighan x 1
been x 1
criticized x 1
For x 1
example x 1
care x 1
more x 1
leave x 1
return x 1
call x 1
&& x 1
|| x 1
entry x 1
include x 1
next x 1
before x 1
sequence point x 1
sequence x 1
points x 1
comma x 1
operator x 1
but x 1
compiler x 1
requires x 1
programmers x 1
exert x 1
optimization x 1
object x 1
This x 1
permits x 1
high x 1
degree x 1
occur x 1
Structured x 1
using x 1
struct x 1
union x 1
enum x 1
define x 1
Declarations x 1
file x 1
contains x 1
Function x 1
turn x 1
assign x 1
perhaps x 1
Keywords x 1
char x 1
int x 1
Sections x 1
name x 1
variable x 1
reserve x 1
usually x 1
writing x 1
type x 1
Each x 1
line x 1
format x 1
rather x 1
column-based x 1
text-line-based x 1
whitespace x 1
arbitrary x 1
FORTRAN x 1
77 x 1
free-form x 1
allows x 1
restrictions x 1
Comments x 1
C99 x 1
following x 1
// x 1
until x 1
*/ x 1
/* x 1
appear x 1
between x 1
delimiters x 1
enclosed x 1
braces x 1
supported x 1
if x 1
-else x 1
conditional x 1
Unlike x 1
reserved x 1
sequential x 1
provides x 1
control-flow x 1
identified x 1
do-while x 1
while x 1
any x 1
omitted x 1
break x 1
continue x 1
expressions x 1
testing x 1
iterative x 1
looping x 1
separate x 1
initialization x 1
normal x 1
modify x 1
control x 1
structures x 1
As x 1
imperative x 1
single x 1
act x 1
sometimes x 1
curly brackets x 1
limit x 1
scope x 1
language x 1
uses x 1
evaluation x 1
assigned x 1
values x 1
To x 1
effect x 1
semicolon x 1
actions x 1
common x 1
consisting x 1
used x 1

var_dump 语句仅显示一致性。 此变体保留双引号表达式。

对于提供的文件,此代码在 0.047 秒内完成。 尽管较大的文件会消耗大量内存(由于 file 功能)。

Here is my variant, in PHP:

$str = implode(file('c.txt'));
$tok = strtok($str, " .,;()\r\n\t");

$splitters = '\s.,\(\);?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );

foreach($array as $key) {
    $res[$key] = $res[$key]+1;
}

$splitters = '\s.,\(\)\{\};?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );

foreach($array as $key) {
    $res[$key] = $res[$key]+1;
}

unset($res['the']);
unset($res['and']);
unset($res['to']);
unset($res['of']);
unset($res['by']);
unset($res['a']);
unset($res['as']);
unset($res['is']);
unset($res['in']);
unset($res['']);

arsort($res);
//var_dump($res); // concordance
foreach ($res AS $word => $rarity)
    echo $word . ' <b>x</b> ' . $rarity . '<br/>';

foreach ($array as $word) { // words longer than n (=5)
//    if(strlen($word) > 5)echo $word.'<br/>';
}

And output:

statement x 7
be x 7
C x 5
may x 5
for x 5
or x 5
The x 5
as x 5
expression x 4
statements x 4
code x 4
function x 4
which x 4
an x 4
declarations x 3
new x 3
execution x 3
types x 3
such x 3
variables x 3
can x 3
languages x 3
operators x 3
end x 2
programming x 2
evaluated x 2
functions x 2
definitions x 2
keywords x 2
followed x 2
contain x 2
several x 2
side x 2
most x 2
has x 2
its x 2
called x 2
specify x 2
reinitialization x 2
use x 2
either x 2
each x 2
all x 2
built-in x 2
source x 2
are x 2
storage x 2
than x 2
effects x 1
including x 1
arguments x 1
order x 1
even x 1
unspecified x 1
evaluations x 1
operands x 1
interleaved x 1
However x 1
value x 1
branches x 1
goto x 1
directly x 1
designated x 1
label x 1
non-structured x 1
also x 1
enclosing x 1
innermost x 1
loop x 1
skip x 1
There x 1
within x 1
switch x 1
Expressions x 1
integer x 1
variety x 1
see x 1
below x 1
will x 1
on x 1
selects x 1
case x 1
executed x 1
based x 1
calls x 1
from x 1
because x 1
many x 1
widespread x 1
familiarity x 1
C's x 1
mimicked x 1
Although x 1
reliable x 1
obtain x 1
results x 1
needed x 1
other x 1
syntax x 1
often x 1
Introduction x 1
say x 1
Programming x 1
Language x 1
C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1
Ritchie x 1
Kernighan x 1
been x 1
criticized x 1
For x 1
example x 1
care x 1
more x 1
leave x 1
return x 1
call x 1
&& x 1
|| x 1
entry x 1
include x 1
next x 1
before x 1
sequence point x 1
sequence x 1
points x 1
comma x 1
operator x 1
but x 1
compiler x 1
requires x 1
programmers x 1
exert x 1
optimization x 1
object x 1
This x 1
permits x 1
high x 1
degree x 1
occur x 1
Structured x 1
using x 1
struct x 1
union x 1
enum x 1
define x 1
Declarations x 1
file x 1
contains x 1
Function x 1
turn x 1
assign x 1
perhaps x 1
Keywords x 1
char x 1
int x 1
Sections x 1
name x 1
variable x 1
reserve x 1
usually x 1
writing x 1
type x 1
Each x 1
line x 1
format x 1
rather x 1
column-based x 1
text-line-based x 1
whitespace x 1
arbitrary x 1
FORTRAN x 1
77 x 1
free-form x 1
allows x 1
restrictions x 1
Comments x 1
C99 x 1
following x 1
// x 1
until x 1
*/ x 1
/* x 1
appear x 1
between x 1
delimiters x 1
enclosed x 1
braces x 1
supported x 1
if x 1
-else x 1
conditional x 1
Unlike x 1
reserved x 1
sequential x 1
provides x 1
control-flow x 1
identified x 1
do-while x 1
while x 1
any x 1
omitted x 1
break x 1
continue x 1
expressions x 1
testing x 1
iterative x 1
looping x 1
separate x 1
initialization x 1
normal x 1
modify x 1
control x 1
structures x 1
As x 1
imperative x 1
single x 1
act x 1
sometimes x 1
curly brackets x 1
limit x 1
scope x 1
language x 1
uses x 1
evaluation x 1
assigned x 1
values x 1
To x 1
effect x 1
semicolon x 1
actions x 1
common x 1
consisting x 1
used x 1

var_dump statement simply displays concordance. This variant preserves double-quoted expressions.

For supplied file this code finishes in 0.047 seconds. Though larger file will consume lots of memory (because of file function).

醉城メ夜风 2024-08-01 13:14:43

这不会赢得任何高尔夫奖项,但它确实将引用的短语放在一起,并考虑了停用词(并利用 CPAN< /a> 模块 Lingua::StopWordsText::ParseWords)。

此外,我使用 Lingua 中的 to_S ::EN::Inflect::Number 仅计算单词的单数形式。

您可能还想查看 Lingua::CollinsParser

#!/usr/bin/perl

use strict; use warnings;

use Lingua::EN::Inflect::Number qw( to_S );
use Lingua::StopWords qw( getStopWords );
use Text::ParseWords;

my $stop = getStopWords('en');

my %words;

while ( my $line = <> ) {
    chomp $line;
    next unless $line =~ /\S/;
    next unless my @words = parse_line(' ', 1, $line);

    ++ $words{to_S $_} for
        grep { length and not $stop->{$_} }
        map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
        @words;
}

print "=== only words appearing 4 or more times ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { $words{$_} > 3 } keys %words;

print "=== only words that are 12 characters or longer ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { 11 < length } keys %words;

输出:

=== only words appearing 4 or more times ===
statement : 11
function : 7
expression : 6
may : 5
code : 4
variable : 4
operator : 4
declaration : 4
c : 4
type : 4
=== only words that are 12 characters or longer ===
reinitialization : 2
control-flow : 1
sequence point : 1
optimization : 1
curly brackets : 1
text-line-based : 1
non-structured : 1
column-based : 1
initialization : 1

This is not going to win any golfing awards but it does keep quoted phrases together and takes into account stop words (and leverages CPAN modules Lingua::StopWords and Text::ParseWords).

In addition, I use to_S from Lingua::EN::Inflect::Number to count only the singular forms of words.

You might also want to look at Lingua::CollinsParser.

#!/usr/bin/perl

use strict; use warnings;

use Lingua::EN::Inflect::Number qw( to_S );
use Lingua::StopWords qw( getStopWords );
use Text::ParseWords;

my $stop = getStopWords('en');

my %words;

while ( my $line = <> ) {
    chomp $line;
    next unless $line =~ /\S/;
    next unless my @words = parse_line(' ', 1, $line);

    ++ $words{to_S $_} for
        grep { length and not $stop->{$_} }
        map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
        @words;
}

print "=== only words appearing 4 or more times ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { $words{$_} > 3 } keys %words;

print "=== only words that are 12 characters or longer ===\n";
print "$_ : $words{$_}\n" for sort {
    $words{$b} <=> $words{$a}
} grep { 11 < length } keys %words;

Output:

=== only words appearing 4 or more times ===
statement : 11
function : 7
expression : 6
may : 5
code : 4
variable : 4
operator : 4
declaration : 4
c : 4
type : 4
=== only words that are 12 characters or longer ===
reinitialization : 2
control-flow : 1
sequence point : 1
optimization : 1
curly brackets : 1
text-line-based : 1
non-structured : 1
column-based : 1
initialization : 1
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文