我正在进行大量数据的字符串匹配。
编辑:我将一个大列表中包含的单词与一些本体文本文件进行匹配。我从本体中获取每个文件,并搜索每个文件行的第三个字符串与列表中的任何单词之间的匹配项。
我在监督这样一个事实时犯了一个错误:我需要做的不是纯粹的匹配(结果很差),但我需要一些更宽松的匹配函数,当该字符串包含在另一个字符串中时,它也会返回结果。
我用 Radix 做到了这一点特里;它非常快并且工作得很好,但现在我想我的工作毫无用处,因为特里树只返回完全匹配的结果。 :/
- 执行此操作的算法类型是字符串搜索算法?
- 有人可以建议一些他有经验的 Java 实现吗?
该算法应该是快速的,但不是最优先考虑的,会损害速度和性能。复杂。
我非常感谢所有建议/示例/解释/链接!
谢谢你!
I am doing string matching with big amount of data.
EDIT: I am matching words contained in a big list with some ontology text files. I take each file from ontology, and search for a match between the third String of each file line and any word from the list.
I made a mistake in overseeing the fact that what I need to do is not pure matching (results are poor), but I need some looser matching function that will also return results when the string is contained inside another string.
I did this with a Radix Trie; it was very fast and works nice, but now I guess my work is useless because a trie returns only exact matches. :/
- Type of algorithms that do this are string searching algorithms?
- Can somebody suggest some Java implementations that he has experience with?
The algorithm should be fast, but is not top top priority, would compomise with speed & complexity.
I am very grateful for all advice/examples/explanations/links!
Thank you!
发布评论
评论(5)
您可能会发现后缀树很有用(它们在概念上与尝试类似)。
每个字符串都以 ^ 开头并以 $ 结尾,并创建所有附加字符串的后缀树。空间使用量将为 O(n),并且可能比 trie 的空间使用率更差。
如果你现在需要搜索字符串 s,你可以在 O(|s|) 时间内轻松完成,就像 trie 一样,你得到的匹配将是子字符串匹配(基本上,你将匹配某个字符串的某些后缀) )。
抱歉,我没有方便的 Java 实现参考。找到了一个有用的 stackoverflow 答案:通用后缀树 Java 实现
其中具有:
http://illya-keeplearning.blogspot。 com/2009/04/suffix-trees-java-ukkonens-algorithm.html
依次具有: 源代码:http://illya.yolasite.com/resources/suffix-tree.zip
You might find Suffix Trees useful (they are similar in concept to Tries).
Each string, you prepend with ^ and end with $ and create a suffix tree of all the strings appended. Space usage will be O(n) and will be probably worse than what you had for the trie.
If you now need to search for a string s, you can easily do in O(|s|) time, just like a trie and the match you get will be a substring match (basically, you will be matching some suffix of some string).
Sorry, I don't have a reference to a Java implementation handy.Found a useful stackoverflow answer: Generalized Suffix Tree Java Implementation
Which has:
http://illya-keeplearning.blogspot.com/2009/04/suffix-trees-java-ukkonens-algorithm.html
Which in turn has: Source Code: http://illya.yolasite.com/resources/suffix-tree.zip
您可以使用 BM 算法 在文本文件中搜索单一模式,并对列表中的所有模式重复此算法。
另一个最佳解决方案是使用多模式搜索算法,例如: Aho–Corasick string匹配算法
you can use BM algorithm for search in text files for single pattern, and repeat this algorithm for all the patterns you have in your list.
the other best solution is to use multi-pattern search algorithms like: Aho–Corasick string matching algorithm
正则表达式绝对是您最好的选择。它们写起来可能有点混乱,但它们是您可以进行更宽松的匹配而无需使用一系列难以理解的 if/else 或 switch 语句的唯一方法。
另外,它们会比替代方案快得多。
Regular expressions are definitely your best bet. They can be a little bit messy to write, but they're the only way that you can have a looser matching without having an incomprehensible series of if/else or switch statements.
Plus, they'll be a lot faster than the alternative.
我不完全确定我是否正确理解了这个问题,但听起来正则表达式可以完成这项工作
http://java.sun.com/developer/technicalArticles/releases/1.4regex/
I'm not entirely sure if I understood the question correctly, but it sounds like regular expressions would do the job
http://java.sun.com/developer/technicalArticles/releases/1.4regex/
为什么不使用java中的indexOf方法呢?根据内存的可用性,读取内容。执行一个indexOf并获取您需要的所有行。加载下一组内容。
如果从文件读取,请使用 nio 流。
也许这个想法很糟糕,但我相信java。它将使用最好的算法。
如果使用正则表达式会更好。
Why don't you use the indexOf method in java. As per the availability of memory, read the content. Do an indexOf and get all the lines you need. Load the next set of contents.
If reading from file use nio streams.
May be the idea is bad, But I belive in java. It will use the best algorithm.
Better if you use regular expression.