查找字符串列表中最长公共前缀/后缀的出现次数?
给定一个字符串列表:
ArrayList<String> strList = new ArrayList<String>();
strList.add("Mary had a little lamb named Willy");
strList.add("Mary had a little ham");
strList.add("Old McDonald had a farm named Willy");
strList.add("Willy had a little dog named ham");
strList.add("(abc)");
strList.add("(xyz)");
strList.add("Visit Target Store");
strList.add("Visit Walmart Store");
这应该以 HashMap
prefixMap
和 suffixMap
的形式生成输出:
前缀:
Mary had a -> 2
Mary had a little -> 2
( -> 2
Visit -> 2
后缀:
named Willy -> 2
ham -> 2
) -> 2
Store -> 2
到目前为止,我可以使用以下代码生成列表中所有项目中存在的前缀:
public static final int INDEX_NOT_FOUND = -1;
public static String getAllCommonPrefixesInList(final String... strs) {
if (strs == null || strs.length == 0) {
return EMPTY_STRING;
}
final int smallestIndexOfDiff = getIndexOfDifference(strs);
if (smallestIndexOfDiff == INDEX_NOT_FOUND) {
// All Strings are identical
if (strs[0] == null) {
return EMPTY_STRING;
}
return strs[0];
} else if (smallestIndexOfDiff == 0) {
// No common initial characters found, return empty String
return EMPTY_STRING;
} else {
// Common initial character sequence found, return sequence
return strs[0].substring(0, smallestIndexOfDiff);
}
}
public static int getIndexOfDifference(final CharSequence... charSequence) {
if (charSequence == null || charSequence.length <= 1) {
return INDEX_NOT_FOUND;
}
boolean isAnyStringNull = false;
boolean areAllStringsNull = true;
final int arrayLen = charSequence.length;
int shortestStrLen = Integer.MAX_VALUE;
int longestStrLen = 0;
// Find the min and max string lengths - avoids having to check that we are not exceeding the length of the string each time through the bottom loop.
for (int i = 0; i < arrayLen; i++) {
if (charSequence[i] == null) {
isAnyStringNull = true;
shortestStrLen = 0;
} else {
areAllStringsNull = false;
shortestStrLen = Math.min(charSequence[i].length(), shortestStrLen);
longestStrLen = Math.max(charSequence[i].length(), longestStrLen);
}
}
// Deals with lists containing all nulls or all empty strings
if (areAllStringsNull || longestStrLen == 0 && !isAnyStringNull) {
return INDEX_NOT_FOUND;
}
// Handle lists containing some nulls or some empty strings
if (shortestStrLen == 0) {
return 0;
}
// Find the position with the first difference across all strings
int firstDiff = -1;
for (int stringPos = 0; stringPos < shortestStrLen; stringPos++) {
final char comparisonChar = charSequence[0].charAt(stringPos);
for (int arrayPos = 1; arrayPos < arrayLen; arrayPos++) {
if (charSequence[arrayPos].charAt(stringPos) != comparisonChar) {
firstDiff = stringPos;
break;
}
}
if (firstDiff != -1) {
break;
}
}
if (firstDiff == -1 && shortestStrLen != longestStrLen) {
// We compared all of the characters up to the length of the
// shortest string and didn't find a match, but the string lengths
// vary, so return the length of the shortest string.
return shortestStrLen;
}
return firstDiff;
}
但是,我的目标是包含任何前缀/后缀至少包含2+
出现到结果地图中。
如何使用 Java
实现这一点?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
根据我对这个问题的理解,最适合解决它的数据结构是非循环脱节图。
一般情况下,一个图将由几个不相连的簇组成。每个簇都会有一个树状结构,在边缘情况下它将形成一个链表。
基本上,解决此问题的最简单的天真方法是根据每一行创建一堆链接列表,然后迭代它们。缺点是:节点重复(内存消耗更大)、时间复杂度更高(需要更多操作)并且更容易出错,因为需要更多的手动操作。
图的描述
因此,我将坚持使用图作为此问题的数据结构,并尝试使事情尽可能简单。
让我们考虑以下输入:
图形的图形表示如下所示;
前两行将构成一个由链表(头部)和树组成的簇 (尾部)。第二个簇将由一个链表表示,它的顶点不与其他字符串形成的顶点连接。
这不是构造顶点的唯一方式,头部可以生成一个 N 树,并且可以在中间的某个位置观察到链表。
主要要点是,为了解决问题,我们需要通过所有分支跟踪顶点链,直到顶点重叠。在图的这些部分中,两条或多条线中常见的每个前缀字符串和后缀字符串将由一个单个顶点(节点< /em>)。
为了维护映射到特定顶点的字符串数量,每个顶点应该有一个变量(
int groupCount
在下面的代码中),当创建顶点时,它被分配默认值1
,并且每次新字符串映射到该顶点时都会递增。每个顶点都包含一个映射,其中保存对其邻居的引用。添加新邻居顶点时,根据给定字符串创建新
顶点
或现有计数 >顶点 增加。为了符合此任务,图应维护对所有头顶点和尾顶点的引用。为了简单起见,在此解决方案中,图不是在每个顶点中维护两组对相邻节点的引用以及两个单独的计数变量(因为后缀计数和前缀计数会不同) /em> 实际上由两个图(后缀图和前缀图)组成。因此,实现类被命名为 MultiGraph。
为了用顶点填充后缀图和前缀图,方法
addCluster()
迭代通过Iterator
以正常或相反的顺序获取给定行的字符串,具体取决于正在填充的图。深度优先搜索
填充图形后的下一步是生成频率为
2
及以上的字符串映射。为此,经典的深度优先搜索算法正在被采用用过的。
为了实现 DFS,需要一个用作堆栈的可变容器(
ArrayDeque
正用于此目的)。从头/尾图中取出的第一个元素将被放置在堆栈的顶部,并且StringBuilder
的实例保存该元素的名称将被放置在地图中。然后,要恢复具有特定计数的字符串,将从堆栈顶部弹出顶点及其邻居 em> 与计数
> 1
依次将被放置在堆栈顶部。附加了分隔符和邻居名称的当前前缀副本将映射到邻居顶点。如果计数发生变化,则表明当前前缀表示至少两行之间的最长公共字符串。在这种情况下,前缀和计数将添加到结果映射中。
实现
以下实现由两个类组成,分别是狭义和自包含。 MultiGraph 类专门充当数据结构,维护两个图。管道代码,就像分割字符串一样,被提取到一个单独的类
GraphManager
中。Graph
下面的类处理处理输入数据的任务:它分割行,负责丢弃可能出现的空字符串,并将输入打包到
Deque 以方便的方式适应两个方向的迭代。
它还实例化图并管理它的工作。
GraphManager
负责向图表提供分隔符,以便在创建结果地图时恢复字符串的初始形状。这样,您可以在空格上分割给定的行,通过空字符串逐个字符地处理行,或者通过标点符号来分割,而无需更改这两个类中的任何一行。GraphManager
main()
输出
In my understanding of this problem the most suitable data structure for solving it is an acyclic disjointed Graph.
In general case, a graph will be comprised of several unconnected clusters. Each cluster will have a tree-like structure, in the edge case it'll form a linked list.
Basically, the most simple naive approach on how to solve this problem is to create a bunch of linked list based on each line, and iterate over them. The drawbacks are: duplication of nodes (greater memory consumption), greater time-complexity (more operations required) and it's more error-prone because more manual actions are needed.
The description of the Graph
So I'll stick with the graph as the data structure for this problem and try to keep things as simple as possible.
Let's consider the following input:
The graphical representation of the graph will look like this;
The two first lines will constitute a cluster formed from a linked list (the head part) and a tree (the tail part). The second cluster will be represented by a linked list, its vertices aren't connected with vertices formed from other strings.
It's not the only way the vertexes can be structured, the head could spawn an N-tree and a linked list could be observed somewhere in the middle.
The main takeaway is that in order to solve the problem, we need to track the chain of vertexes through all the branches until the vertexes overlap. In these parts of the graph, every prefix-strings and suffix-string that is common among two or more lines will be represented by a single vertex (node).
To maintain the number of strings that are mapped to a particular vertex, each vertex should have a variable (
int groupCount
in the code below), which is assigned with a default value of1
when a vertex is being created and incremented each time a new string gets mapped to this vertex.Each vertex contains a map that holds references to its neighbours. When a new neighbour-vertex is being added, either new
Vertex
in being created based on the given string or the count of an existing vertex gets incremented.In order to conform to this task, the graph should maintain references to all head-vertexes and tail-vertexes. For simplicity, instead of maintaining two groups of references to adjacent nodes, and two separate count variables (because suffix-count and prefix-count will differ) in each vertex, in this solution graph is actually comprised of two graph (suffix-graph and prefix-graph). And for that reason, the implementing class in named
MultiGraph
.In order to populate both suffix-graph and prefix-graph with vertexes, method
addCluster()
iterates over the string of the given line by the means ofIterator
in normal or reversed order, depending on which graph is being populated.Depth first search
The next step after the graphs are populated is to generate the maps of strings with the frequency of
2
and greater.For that, the classical depth first search algorithm is being used.
In order to implement the DFS, a mutable container that will be used as a stack is required (
ArrayDeque
is being used for that purpose). The first element that is taken from the map of heads/tails will be placed on the top of the stack and an instance ofStringBuilder
holding the name of this element will be placed in the map.Then, to restore a string with a particular count, vertexes will be popped from the top of the stack and their neighbours with count
> 1
in turn will be placed on top of the stack. A copy of the current prefix with the delimiter and the neighbour's name appended will get mapped to the neighbour-vertex.If a count changes, that indicates that the current prefix represents the longest common string between at least two lines. In this case, prefix and count are being added to the resulting map.
Implementation
The following implementation consists of two classes that are narrow-focused and self-contained. The
MultiGraph
class acts exclusively as data structure, maintaining two graphs. The pluming code, like splitting the lines of strings, is extracted into a separate classGraphManager
.Graph
The following class deals with the task of processing the input data: it splits the lines, takes care of discarding the empty string which might take place, and packs the input into a
Deque
to accommodate the iteration in both directions in a convenient way.It also instantiates the graph and governs it's work.
GraphManager
takes care of providing the delimiter to the graph in order to restore the initial shape of strings while the resulting maps are being created. With that you can split the given lines on a white space, by empty string to process lines character by character or by punctuation marks without changing a single line in these two classes.GraphManager
main()
Output
这个问题应该可以使用 特里。
trie 节点基本上应该跟踪两件事:
将所有字符串插入到 trie 中,这将在
O(字符串长度 * 字符串数量)
内完成。之后,只需遍历 trie,您就可以根据您的用例根据计数对前缀进行哈希处理。对于后缀,您可以使用相同的方法,只需开始以相反的顺序遍历字符串即可。编辑:
再想一想,trie 可能是最有效的方法,但简单的 hashmap 实现也应该在这里工作。下面是生成所有包含 count > 的前缀的示例。 1.
尽管如此,为了完成,我还将添加一个基本的
TrieNode
实现,因为我的答案首先提出了这种方法。TrieNode
:Trie
:遍历 trie 类似于简单的 DFS 场景,其中还维护到当前节点的“路径”(我们使用前缀切换路径)到目前为止)。
This problem should be solved easily using a trie.
The trie node should basically keep a track of 2 things:
Insert all strings in the trie, which will be done in
O(string length * number of strings)
. After that, simply traversing the trie, you can hash the prefixes based on the count as per your use case. For suffixes, you can use the same approach, just start traversing the strings in reverse order.Edit:
On second thought, trie might be the most efficient way, but a simple hashmap implementation should also work here. Here's an example to generate all prefixes with count > 1.
Nonetheless, I'll also add a basic
TrieNode
implementation for the sake of completion, since my answer proposed that approach in the first place.TrieNode
:Trie
:Traversing the trie would be analogous to a simple DFS scenario where the "path" upto current node is also maintained (we're switching the path with the prefix upto this point).
要实现 trie,首先需要一个节点。
在节点中,我们有实际值、对父节点的引用,以便更容易构建前缀、将值映射到相应节点的子节点,以及节点是否为结束节点的标志。
实际的
Trie
实现从某种意义上说,它有所简化,仅实现了
insert
,但我们现在不需要更新和删除。需要注意的几点:null
值和null
父级来表示。我们绝不能永远更新此节点。BiConsumer, T>
。这是我们在创建前缀时添加元素的顺序的抽象。这是必需的,因为在反转集合/句子时,计数后缀可以表示并实现为计数前缀。countPrefixes()
的操作countPrefixes()
的返回类型Map, Integer>
。这种实现更加通用,因此每个前缀都表示为节点值的集合。ArrayList
?首先我们需要保持插入顺序。其次,它的hashCode()
和equals()
实现使其成为映射键的良好候选者。引用 javadoc:This确保 list1.equals(list2) 意味着任意两个列表 list1 和 list2 的 list1.hashCode()==list2.hashCode(),正如 Object.hashCode 的一般契约所要求的。
- for哈希码。换句话说,如果两个列表包含相同顺序的相同元素,则它们被定义为相等。
- 对于 equals。主要要测试。
这里需要注意的是:
(
的输入拆分为字符而不是单词作为其余部分来实现输出,但示例有点令人困惑。对于句子前缀是单词的集合,而对于这两个单词,它们是字符。trie
用于后缀 - 否则我们可能将前缀算作后缀,反之亦然。Arrays.asList()
返回的列表由输入数组支持。反转输入数组本身 编辑:在使用字符进行测试时AbstractSet,或者任何精确的集合,对于前缀来说都是一个坏主意,因为它不允许重复的元素。更新了 trie 实现以使用
ArrayList
。我也添加还有 2 个主要方法,用于演示 trie 处理字符串和字符的方法:To implement a trie, first you need a node.
In the node we have actual value, reference to the parent node, to make it eaiser to build prefixes, child nodes mapped value to corresponding node, and a flag if node is end node.
Actual
Trie
implementationIt is somewhat simplified in the sense, that only
insert
is implemented, but we don't need update and delete now. Few things to note:null
value andnull
parent. We must never update this node.BiConsumer<Deque<T>, T> operation
ofcountPrefixes()
. This is abstraction for the order in which we want to add elements when creating a prefix. It's needed, because counting suffixes can be represented, and is implemented, as counting prefixes when reversing the collection/sentence.Map<Collection<T>, Integer>
ofcountPrefixes()
. This implementation is more generic, thus each prefix is represented as collection of node values.ArrayList
? First we need to keep insertion order. Second its' implementation ofhashCode()
andequals()
makes it good candidate for map key. To citate javadoc:This ensures that list1.equals(list2) implies that list1.hashCode()==list2.hashCode() for any two lists, list1 and list2, as required by the general contract of Object.hashCode.
- for hashCode.In other words, two lists are defined to be equal if they contain the same elements in the same order.
- for equals.A main to test.
Things to note here:
(
into characters and not words as the rest to achieve your output, but example is somewhat confusing. For sentences prefixes are collection of words, while for those two they are characters.trie
for suffixes - otherwise we might count prefix as suffix and vice versa.Arrays.asList()
returned list is backed by input array. I am taking advantage of that to reverse the input array itself for suffixes.EDIT: While testing with characters realized AbstractSet, or any set to be precise, is a bad idea for prefix, because it does not allow duplicated elements. Updated trie implementation to use
ArrayList
. I'm also adding 2 more main methods, to demonstrate the trie working with strings and with characters:我认为@Abhinav 提供的解决方案应该使用 HashMap 来工作。在这里,我将使用 java 中的简单 Trie 实现发布解决方案(进行一些自定义,例如将 freq 添加到 Trie 节点)。
以下是该算法所需的其他 2 个类:
输出是前缀:(对于后缀应该类似,只需向后构建 Trie)
这里我使用 BFS 来检索 Trie 中频率等于 maxFreq 的所有子字符串。我们可以根据需要调整过滤条件。你也可以在这里做DFS。
其他考虑因素是我们可以将前缀添加到 TNode 类本身中,我更喜欢将其单独放在另一个类中。
I think the solution provided by @Abhinav should work using HashMap. Here I will post the solution using a simple Trie implementation in java (with some customization, such as adding freq into the Trie Node).
Here are 2 other classes required for this algorithm:
The output is for prefix: (for postfix should be similar, just need to build the Trie backward)
Here I am using a BFS to retrieve all sub strings having frequency equal maxFreq in the Trie. We can adjust the filter condition based on the need. You can do the DFS here also.
Other consideration is we can add prefix into the TNode class itself, I prefer to keep it separate in another class.