Code Golf:从文本快速构建关键字列表,包括实例数
我已经用 PHP 为自己制定了这个解决方案,但我很好奇如何以不同的方式完成它 - 甚至更好。 我主要感兴趣的两种语言是 PHP 和 Javascript,但我有兴趣看看当今任何其他主要语言(主要是 C#、Java 等)可以多快地完成此操作。
- 仅返回出现次数大于 X 的单词
- 仅返回长度大于 Y 的单词
- 忽略诸如“and、is、the 等”之类的常用术语
- 在处理之前随意删除标点符号(即“John's”变为“John”)
- 在集合/数组中返回结果
Extra Credit
- 将引用的语句放在一起,(即“它们显然‘好得令人难以置信’”)
其中“好得令人难以置信”将是实际语句
Extra-Extra Credit
- 您的脚本能否根据单词出现在一起的频率来确定应该放在一起的单词? 这是在事先不知道单词的情况下完成的。 例子:
*“果蝇在医学研究方面是一件伟大的事情。过去人们对果蝇进行了大量研究,并取得了许多突破。未来,果蝇将继续被研究研究过,但我们的方法可能会改变。”*
显然这里的词是“果蝇”,我们很容易找到。 您的 search'n'scrape 脚本也能确定这一点吗?
源文本:http://sampsonresume.com/labs/c.txt< /a>
答案格式
- 如果能够看到代码的结果、输出以及操作持续的时间,那就太棒了。
I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).
- Return only words with an occurrence greater than X
- Return only words with a length greater than Y
- Ignore common terms like "and, is, the, etc"
- Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
- Return results in a collection/array
Extra Credit
- Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")
Where 'too good to be true' would be the actual statement
Extra-Extra Credit
- Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
*"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."*
Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?
Source text: http://sampsonresume.com/labs/c.txt
Answer Format
- It would be great to see the results of your code, output, in addition to how long the operation lasted.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(13)
GNU 脚本
结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | awk '$1>X'仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| grep -vf ignored | sort | uniq -c在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | awk '$1>X'仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| grep .... | sort | uniq -c忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | awk '$1>X'仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | awk '$1>X'仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| grep .... | sort | uniq -c忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | awk '$1>X'仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| grep -vf ignored | sort | uniq -c在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | awk '$1>X'仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| grep .... | sort | uniq -c忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | awk '$1>X'仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
| sort | uniq -c | sort -nr结果:
出现次数大于 X:
仅返回长度大于 Y 的单词(在第二个 grep 中放置 Y+1 个点):
忽略常见术语,例如“and、is、the 等” (假设通用术语在文件“忽略”中)
在处理之前随意删除标点符号(即“John's”变成“John”):
在集合/数组中返回结果:它已经像 shell 的数组一样:第一列是计数,第二列是单词。
GNU scripting
Results:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | awk '$1>X'Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| grep -vf ignored | sort | uniq -cFeel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | awk '$1>X'Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| grep .... | sort | uniq -cIgnore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | awk '$1>X'Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -cReturn results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | awk '$1>X'Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| grep .... | sort | uniq -cIgnore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | awk '$1>X'Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| grep -vf ignored | sort | uniq -cFeel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | awk '$1>X'Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| grep .... | sort | uniq -cIgnore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | awk '$1>X'Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
| sort | uniq -c | sort -nrResults:
With occurence greater than X:
Return only words with a length greater than Y (put Y+1 dots in second grep):
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
Perl 只有 43 个字符。
下面是它的使用示例:
如果您只需要列出小写版本,则还需要两个字符。
为了使其能够处理指定的文本,需要 58 个字符。
这是最后一个示例,稍作扩展。
Perl in only 43 characters.
Here is an example of it's use:
If you need to list only the lowercase versions, it requires two more characters.
For it to work on the specified text requires 58 characters.
Here is the last example expanded a bit.
F#:304 个字符
F#: 304 chars
Ruby
当“缩小”时,此实现的长度变为 165 个字符。 它使用 array#inject 给出一个起始值(默认值为 0 的 Hash 对象),然后循环遍历元素,然后将其滚动到哈希中; 然后从最小频率中选择结果。
请注意,我没有计算要跳过的单词的大小,这是一个外部常量。 如果把常量也算进去,则解的长度为244个字符。
撇号和破折号不会被删除,而是包含在内; 它们的使用会修改单词,因此如果不删除符号之外的所有信息,就不能简单地剥离它们。
实施
测试台
测试结果
Ruby
When "minified", this implementation becomes 165 characters long. It uses
array#inject
to give a starting value (a Hash object with a default of 0) and then loop through the elements, which are then rolled into the hash; the result is then selected from the minimum frequency.Note that I didn't count the size of the words to skip, that being an external constant. When the constant is counted too, the solution is 244 characters long.
Apostrophes and dashes aren't stripped, but included; their use modifies the word and therefore cannot be stripped simply without removal of all information beyond the symbol.
Implementation
Test Rig
Test Results
C# 3.0(使用 LINQ)
这是我的解决方案。 它利用 LINQ/扩展方法的一些非常好的功能来保持代码简短。
然而,这距离最有效的方法还很远,因为单词数是
O(n^2)
,而不是在本例中最佳的O(n)
我相信。 我会看看是否可以创建一个稍微长一点、更有效的方法。以下是函数在示例文本上运行的结果(最少出现次数:3,最小长度:2)。
还有我的测试程序:
C# 3.0 (with LINQ)
Here's my solution. It makes use of some pretty nice features of LINQ/extension methods to keep the code short.
This is however far from the most efficient method, being
O(n^2)
with the number of words, rather thanO(n)
, which is optimal in this case I believe. I'll see if I can creater a slightly longer method that is more efficient.Here are the results of the function run on the sample text (min occurences: 3, min length: 2).
And my test program:
这就是简单的形式。 如果您想要排序、过滤等:
您还可以非常轻松地对输出进行排序:
真正的 Perl 黑客可以轻松地在每行一两行中获得这些内容,但我追求可读性。
编辑:这就是我重写最后一个示例的方式
或者如果我需要它运行得更快我什至可以这样写:
它使用地图来提高效率,
grep 删除多余的元素,
当然,还有 sort 来进行排序。 (它按照该顺序执行)
这是 施瓦茨变换。
That's the simple form. If you want sorting, filtering, etc.:
You can also sort the output pretty easily:
A true Perl hacker will easily get these on one or two lines each, but I went for readability.
Edit: this is how I would rewrite this last example
Or if I needed it to run faster I might even write it like this:
It uses map for efficiency,
grep to remove extra elements,
and sort to do the sorting, of course. ( it does so it in that order )
This is a slight variant of the Schwartzian transform.
另一个 Python 解决方案,有 247 个字符。 实际代码是一行 134 个字符的高密度 Python 行,它在单个表达式中计算整个过程。
更长的版本,包含大量注释,供您阅读乐趣:
这里的主要技巧是使用 itertools.groupby 函数来计算排序列表中的出现次数。 不知道它是否真的节省了字符,但它确实允许所有处理发生在单个表达式中。
结果:
Another Python solution, at 247 chars. The actual code is a single line of highly dense Python line of 134 chars that computes the whole thing in a single expression.
A much longer version with plenty of comments for you reading pleasure:
The main trick here is using the itertools.groupby function to count the occurrences on a sorted list. Don't know if it really saves characters, but it does allow all the processing to happen in a single expression.
Results:
C# 代码:
ProcessText(text, 3, 2) 调用的输出:
C# code:
Output for ProcessText(text, 3, 2) call:
在 C# 中:
使用 LINQ,特别是 groupby,然后按组计数进行筛选,并返回展平的 (selectmany) 列表。
使用 LINQ
使用 LINQ,按长度过滤。
使用 LINQ,使用“badwords”进行过滤。包含。
In C#:
Use LINQ, specifically groupby, then filter by group count, and return a flattened (selectmany) list.
Use LINQ, filter by length.
Use LINQ, filter with 'badwords'.Contains.
REBOL
Verbose,也许,所以绝对不是赢家,但完成了工作。
输出是:
REBOL
Verbose, perhaps, so definitely not a winner, but gets the job done.
The output is:
Python (按原样 258 个字符,包括第一行 66 个字符和用于标点符号删除的 30 个字符):
输出:
Python (258 chars as is, including 66 chars for first line and 30 chars for punctuation removal) :
output :
这是我的 PHP 变体:
输出:
var_dump
语句仅显示一致性。 此变体保留双引号表达式。对于提供的文件,此代码在 0.047 秒内完成。 尽管较大的文件会消耗大量内存(由于
file
功能)。Here is my variant, in PHP:
And output:
var_dump
statement simply displays concordance. This variant preserves double-quoted expressions.For supplied file this code finishes in 0.047 seconds. Though larger file will consume lots of memory (because of
file
function).这不会赢得任何高尔夫奖项,但它确实将引用的短语放在一起,并考虑了停用词(并利用 CPAN< /a> 模块 Lingua::StopWords 和 Text::ParseWords)。
此外,我使用 Lingua 中的
to_S
::EN::Inflect::Number 仅计算单词的单数形式。您可能还想查看 Lingua::CollinsParser。
输出:
This is not going to win any golfing awards but it does keep quoted phrases together and takes into account stop words (and leverages CPAN modules Lingua::StopWords and Text::ParseWords).
In addition, I use
to_S
from Lingua::EN::Inflect::Number to count only the singular forms of words.You might also want to look at Lingua::CollinsParser.
Output: