生成词典的字数统计
我有索引过程,它从 mysql 数据库获取文档并将它们放入 solr 中。
我想记录每个单词出现的次数以进行拼写纠正。
显然,我可以将它们全部转储到一个纯文本文件中,然后排序 | uniq -c 该文件,并保留所有计数超过 N 的单词 - 这是要走的路,还是有一些更聪明的方法?
I have indexing process which takes documents from mysql database and puts them into solr.
I want to record number of occurrences of each word for spelling correction.
I could obviously just dump them all into one plaintext file as I go, then sort | uniq -c
that file, and keep all words with counts above some N - is it the way to go, or is there some smarter way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在 SOLR 中,你有多个方面。您可以尝试使用指向存储您感兴趣的文本数据的字段的facet.field 来执行构面搜索。您可以使用facet.mincount在特定频率级别上剪切输出列表。确保设置facet.zeroes = false以从结果频率列表中排除任何可能的零。
In SOLR you have facets. You can try executing a facet search with the facet.field pointing to the field storing the text data of your interest. You can use facet.mincount to cut the output list on a certain frequency level. Make sure to set facet.zeroes=false to exclude any possible zeroes from the resulting frequency list.