分析文本(词形还原、编辑距离)
我需要分析文本中是否存在禁用词。假设黑名单是单词:“禁止”。这个词有多种形式。在文本中,该词可以是例如:“禁止”、“禁止”、“禁止”。为了将单词转化为初始形式,我使用了词形还原过程。你的建议?
拼写错误怎么办?
例如:“F0rb1d”。我认为使用damerau-Levenshtein 或其他。你的建议?
如果文本写成如下:
“禁止信息。公司的私人信件。”或者 “F0rb1dden1nformation。私人对应的公司。” (是的,没有空格)
如何解决这个问题?
最好是快速算法,因为文本是实时处理的。
也许还有一些提高性能的技巧(如何存储等)?
I need to analyze the text to exist in it banned words. Suppose the black list is the word: "Forbid". The word has many forms. In the text the word can be, for example: "forbidding", "forbidden", "forbad". To bring the word to the initial form, I use a process lemmatization. Your suggestions?
What about typos?
For example: "F0rb1d". I think use damerau–Levenshtein or another. You suggestions?
And what if the text is written as follows:
"ForbiddenInformation.Privatecorrespondenceofthecompany." OR
"F0rb1dden1nformation.Privatecorresp0ndenceofthec0mpany." (yes, without whitespace)
How to solve this problem?
Preferably fast algorithm, because text are processed in real time.
And maybe what some tips to improve performance (how to store, etc)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
据我所知,有两种可能的解决方案。
您可以尝试使用动态规划,LCS(最长公共子序列)。它将在原始文本中搜索所需的单词作为模式,我相信它是 O(mn):
http://en .wikipedia.org/wiki/Longest_common_subsequence_problem
http://www.ics.uci.edu/~eppstein/161/960229 .html
虽然使用文本搜索算法更容易。我知道的最好的是KMP,它的复杂度为O(n)。为了进行字符比较,您可以将它们分组为 {i I l(L) 1}、{o O 0} 等集合。但是您可以修改它以不匹配所有字母(禁止 -> 禁止)。
http://en.wikipedia.org/wiki/Knuth -Morris-Pratt_algorithm
所以现在您可以比较这两个算法和您的建议的优点。
there're two possible solutions as far as I know algorithms.
You could try to use dynamic programming , LCS (longest common subsequence). It will search original text for the desired word as pattern, I believe it's O(mn):
http://en.wikipedia.org/wiki/Longest_common_subsequence_problem
http://www.ics.uci.edu/~eppstein/161/960229.html
Although the easier would be to use text search algorithm. The best I know is KMP and it's O(n). For character comparison you could group them into sets like {i I l(L) 1}, {o O 0} and so on. Yet you could modify this for not matching all letters (forbid -> forbad).
http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm
So now you could compare benefits of these two and yours suggestion.
您还可以使用正则表达式匹配来检查单词。
http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am /regexppsd.aspx
You could also use RegEx Matches to check for words.
http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx