句子分析和标记化算法
我需要分析文档并统计每个单词序列的使用次数(因此分析不是针对单个单词,而是针对一批重复出现的单词)。我读到压缩算法做了与我想要的类似的事情 - 创建文本块的字典以及报告其频率的信息。 它应该类似于 http://www.codeproject.com/KB/recipes/模式.aspx 你有用 C# 写的东西吗?
I need to analyze a document and compile statistics as to how many times each a sequence of words is used (so the analysis is not on single words but of batch of recurring words). I read that compression algorithms do something similar to what I want - creating dictionaries of blocks of text with a piece of information reporting its frequency.
It should be something similar to http://www.codeproject.com/KB/recipes/Patterns.aspx
Do you have anything written in C#?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这实现起来非常简单。
使用Split(字符串类的成员函数)来将字符串拆分为单词。 (您可以在代码项目 URL 中使用分隔符)。
一个for循环,用于枚举所有n-gram并使用
Dictionary
来获取计数。This is very simple to implement.
Use Split(a member function of string class) to split the string into words. (you can use the delimiters in the codeproject url).
A forloop to enumerate all the n-gram out and use
Dictionary<string, int>
to get the count.