比较字符串列表的最佳数据结构和算法是什么?
我想找到符合以下规则的最长可能的单词序列:
- 每个单词最多可以使用一次
- 所有单词都是字符串
- 两个字符串
sa
和sb
可以连接如果sa
的最后两个字符与sb
的前两个字符匹配。
在连接的情况下,它是通过重叠这些字符来执行的。例如:
- sa = "torino"
- sb = "novara"
- sa concat sb = "torinovara"
例如,我有以下输入文件“input.txt”:
诺瓦拉
都灵
韦尔切利
拉维纳
那不勒斯
利沃诺
麦萨尼亚
新人
罗马
并且,上述文件按照上述规则的输出应该是:
都灵
诺瓦拉
拉维纳
那不勒斯
里窝那
新人
因为最长的可能串联是:
torinovaravennapolivornovilligure
任何人都可以帮我解决这个问题吗?对此最好的数据结构是什么?
I want to find the longest possible sequence of words that match the following rules:
- Each word can be used at most once
- All words are Strings
- Two strings
sa
andsb
can be concatenated if the LAST two characters ofsa
matches the first two characters ofsb
.
In the case of concatenation, it is performed by overlapping those characters. For example:
- sa = "torino"
- sb = "novara"
- sa concat sb = "torinovara"
For example, I have the following input file, "input.txt":
novara
torino
vercelli
ravenna
napoli
liverno
messania
noviligure
roma
And, the output of the above file according to the above rules should be:
torino
novara
ravenna
napoli
livorno
noviligure
since the longest possible concatenation is:
torinovaravennapolivornovilligure
Can anyone please help me out with this? What would be the best data structure for this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这可以表示为有向图问题——节点是单词,如果重叠则通过边连接(选择最小的重叠以获得最长的长度),然后找到权重最高的不相交路径。
(好吧,实际上,您想稍微扩展一下图表以处理单词的开头和结尾。将“起始节点”与权重长度为单词/2 的每个单词连接起来。
在单词之间,-overlap + length start + length finish / 2,以及在每个单词和“结束节点”之间“length word / 2”。将其加倍可能会更容易。)
https://cstheory.stackexchange。 com/questions/3684/max-non-overlapping-path-in-weighted-graph
This can be presented as a directed graph problem -- the nodes are words, and they are connected by an edge if they overlap (and the smallest overlap is chosen to get the longest length), and then find the highest weight non-intersecting path.
(Well, actually, you want to expand the graph a bit to handle beginning and ending at a word. Adjoin a "starting node" with with an edge to each word of weight length word / 2.
Between words, -overlap + length start + length finish / 2, and between each word and an "ending node" "length word / 2". Might be easier to double it.)
https://cstheory.stackexchange.com/questions/3684/max-non-overlapping-path-in-weighted-graph
我会从非常简单的开始。制作 2 个字符串向量,一个按正常排序,一个按最后 2 个字母排序。为第二个向量创建一个索引(整数向量),指出它在第一个向量中的位置。
要找到最长的..首先删除孤儿。两端都不匹配任何内容的单词。然后你想要构建一个邻居连接树,在这里你可以确定哪些单词可能相互到达。如果您有两棵或更多树,您应该首先尝试最大的树。
现在,对于一棵树,你的工作就是找到罕见的末端,并将它们与其他末端绑定,然后重复。这应该会给你一个非常好的解决方案,如果它使用了你的所有黄金词,请跳过其他树。如果没有,那么你就需要使用一系列算法来提高效率。
需要考虑的一些事项:
如果你有 3 个以上的独特结局,你一定会掉落 1 个以上的单词。
这可以用来在寻找答案时减少你的尝试。
经常重新计算独特的结果。
给定一端的奇数数字确保必须丢弃其中一个(在两端您将获得 2 个免费赠品)。
隔离可以自我匹配的单词,你总是可以将它们放在最后,否则它们会搞乱数学。
您也许能够创建小的自匹配环,您可以将它们视为自匹配单词,只要您在创建它们时不孤立它们即可。这可以使性能变得非常出色,但不能保证完美的解决方案。
搜索空间是阶(N!)数百万个元素的列表可能很难证明确切的答案。当然,我可能忽略了一些事情。
I would start really simple. Make 2 vectors of strings, one sorted normally, one sorted by the last 2 letters. Create an index (vector of ints) for the second vector that points out it's position in the first.
To find the longest.. first remove the orphans. words that don't match at either end to anything. Then you want to build a neighbor joining tree, this is where you determine which words can possibly reach each other. If you have 2 or more trees you should try the largest tree first.
Now with a tree your job is to find ends that are rare, and bind them to other ends, and repeat. This should get you a pretty nice solution, if it uses all the words your golden, skip the other trees. If it doesn't then your into a whole slew of algorithms to make this efficient.
Some items to consider:
If you have 3+ unique ends you are guaranteed to drop 1+ words.
This can be used to prune your tries down while hunting for an answer.
recalculate unique ends often.
Odd numbers of a given end ensure that one must be dropped(you get 2 freebies at the ends).
Segregate words that can self match , you can always throw them in last, and they muck up the math otherwise.
You may be able to create small self matching rings, you can treat these like self matching words, as long as you don't orphan them when you create them. This can make the performance fantastic, but no guarantees on a perfect solution.
The search space is order(N!) a list of millions of elements may be hard to prove an exact answer. Of course I could be overlooking something.