从句子生成 N 元语法
如何生成字符串的 n 元语法,例如:
String Input="This is my car."
我想使用此输入生成 n 元语法:
Input Ngram size = 3
输出应该是:
This
is
my
car
This is
is my
my car
This is my
is my car
用 Java 给出一些想法,如何实现它或者是否有可用的库。
我正在尝试使用 这个 NGramTokenizer 但它给出了 n-gram 的字符序列,而我想要 n-gram 的单词序列。
How to generate an n-gram of a string like:
String Input="This is my car."
I want to generate n-gram with this input:
Input Ngram size = 3
Output should be:
This
is
my
car
This is
is my
my car
This is my
is my car
Give some idea in Java, how to implement that or if any library is available for it.
I am trying to use this NGramTokenizer but its giving n-gram's of character sequence and I want n-grams of word sequence.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
看看这个:
简单的递归函数,更好的运行时间。
Check this out:
Simple recursive function, better running time.
我相信这会满足您的要求:
输出:
作为迭代器实现的“按需”解决方案:
I believe this would do what you want:
Output:
An "on-demand" solution implemented as an Iterator:
您正在寻找 木瓦过滤器。
更新:链接指向版本3.0.2。在较新版本的 Lucene 中,此类可能位于不同的包中。
You are looking for ShingleFilter.
Update: The link points to version 3.0.2. This class may be in different package in newer version of Lucene.
此代码返回给定长度的所有字符串的数组:
例如
This code returns an array of all Strings of the given length:
E.g.
调用:
输出:
Call:
Output:
这是我创建 n-gram 的代码。在这种情况下,n = 2, 3。小于截止值的n-gram单词序列将从结果集中忽略。输入是句子列表,然后使用 OpenNLP 工具进行解析
Here is my codes to create n-gram. In this case, n = 2, 3. n-gram of words sequence which smaller than cutoff value will ignore from result set. Input is list of sentences, then it parse using a tool of OpenNLP