字符串数组仅包含字谜?
有人给我做了一个关于字谜的练习,它看起来非常简单,我怀疑我错过了什么。 我实施的解决方案是我很快就会介绍的解决方案,我想问您是否可以想到我的解决方案有任何优化、方法改变或问题。 我用Java实现了该算法。
现在,练习。 作为输入,我有一个文本,作为输出,我应该返回该文本的每一行是否是每一行的字谜。 也就是说,对于输入:
A Cab Deed Huffiest Minnows Loll
出租车契约哈菲斯特米诺洛尔
出租车契约洗牌了数百万美元
出租车契约洗牌了百万镇
程序应该返回 True。 输入:
出租车契约 Huffiest Minnows Loll
出租车契约哈菲斯特米诺闲逛
出租车契约洗牌了数百万美元
A Cab Deed Shuffles Million Town 的
输出必须为 False(当然是因为第二行)。
现在,我的想法非常简单:
- 我创建 2 个 HashMap:ref 和 cur。
- 我解析文本的第一行,填充 ref。我只会计算字母。
- 对于每一行,我将该行解析为 cur 并检查是否 cur.equals(ref): if so return false
- 如果我到达文本末尾,这意味着每一行都是每一行的字谜,所以我返回真。
而且……就是这样。 我尝试输入 88000 行的文本,速度相当快。
有什么意见吗?建议?优化?
非常感谢您的帮助。
I have been given an exercise about anagrams and it looked so really easy that I have a doubt I am missing something.
The solution I implemented is the one I will present shortly, and I wanted to ask you if you could think of any optimization, change of approach or problem with my solution.
I implemented the algorithm in Java.
Now, the exercise.
As input I have a text and as output I should return whether each line of this text is an anagram of each other line.
That is, for input:
A Cab Deed Huffiest Minnows Loll
A Cab Deed Huffiest Minnow Lolls
A Cab Deed Shuffles Million Wont
A Cab Deed Shuffles Million Town
The program should return True.
For input:
A Cab Deed Huffiest Minnows Loll
A Cab Deed Huffiest Minnow Lolls hi
A Cab Deed Shuffles Million Wont
A Cab Deed Shuffles Million Town
the output will have to be False (because of the second line, of course).
Now, what I thought is pretty straightforward:
- I create 2 HashMap: ref and cur.
- I parse the first line of the text, filling ref. I will count only alphabetical letters.
- for each other line, I parse the line into cur and check if cur.equals(ref): if so return false
- if I get to the end of the text, it means that each line is an anagram of each other line, so I return true.
And...this would be it.
I tried it with an input text of 88000 lines, and it works pretty fast.
Any comments? Suggestions? Optimizations?
Thank you very much for the help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
另一种选择是:
.equals
)我怀疑您的方法更快尽管。
编辑:
由于@nibot 不同意我的建议,而且我不是一个在没有证据的情况下来回争论的人,这里有三个解决方案。
它们的实现都非常相似:
?部分是以下之一:
HashMap
对我用以下
命令 运行它们:文件与 OP 发布的文件类似,但明显更长,并在末尾添加了约 20 行的非字谜,以确保算法全部正常工作。
我一直得到这样的结果:
如果满足以下条件,则
HashMap
可以得到显着改进:HashMap
HashMap
以及一种获取和递增的方法(因此只有一次查找而不是 2 次)但是,这些不在标准库中,所以我忽略它们(就像大多数使用 Java 的程序员会)。
这个故事的寓意是大O并不代表一切。您需要考虑开销和n的大小。在这种情况下,n 相当小,并且
HashMap
的开销非常大。如果线路更长,情况可能会改变,但不幸的是,我不想弄清楚盈亏平衡点在哪里。如果您仍然不相信我,请考虑 GCC 使用 在某些情况下,插入排序在其 C++ 标准库中。
Another option is:
.equals
)I suspect your way is faster though.
EDIT:
Since @nibot disagrees with my even suggesting this, and I'm not one to argue back and forth without proof, here's three solutions.
They're all implemented very similarly:
The ? part is one of:
HashMap
of character countsI ran them all with this:
My file is similar to the one the OP posted, but made significantly longer, with a non-anagram about 20 lines from the end to ensure that the algorithms all work.
I consistently get results like this:
The
HashMap
one could be significantly improved if:HashMap<char, int>
HashMap
and a way to get-and-increment (so there would only be one lookup instead of 2)But, these aren't in the standard library, so I'm ignoring them (just like most programmers using Java would).
The moral of the story is that big O isn't everything. You need to consider the overhead and the size of n. In this case, n is fairly small, and the overhead of a
HashMap
is significant. With longer lines, that would likely change, but unfortunately I don't feel like figuring out where the break-even point is.And if you still don't believe me, consider that GCC uses insertion sort in some cases in its C++ standard library.
假设你的 HashMap 是从 (character) -> 的映射(字符串中出现的次数),你几乎已经拥有了。
我认为您应该忽略空格和标点符号,并将大写字母和小写字母视为相同。如果您不使用英语以外的任何语言,那么 HashMap 就有点过分了:您可以简单地使用代表 A..Z 的 26 个计数的数组。如果你需要支持 Unicode 那么问题当然要复杂得多,因为你不仅需要处理可能数千种不同类型的字母,而且还必须定义“字母”(幸运的是存在字符属性数据对此有帮助)和“小写/大写”(请注意,有些语言没有大小写,有些可以将两个小写字母映射为一个大写字母,反之亦然......)。更不用说标准化了:)
Assuming that your HashMap is a mapping from (character) -> (number of occurrences in the string), you pretty much have it.
I assume that you're supposed to ignore whitespace and punctuation, and treat uppercase and lowercase letters as the same. If you're not using any languages other than English, then a HashMap is overkill: you can simply use an array of 26 counts representing A..Z. If you need to support Unicode then the problem is of course much more complicated, as not only do you need to deal with possibly thousands of different kinds of letters, but you also have to define 'letter' (fortunately there exists character property data that helps with this) and 'lowercase/uppercase' (note that some languages don't have case, some can map two lowercase letters into a single uppercase one or vice-versa...). To say nothing of normalization :)
构建 @Karl Knechtel 的答案(并解决您对支持多个字母表的担忧):
创建接口(例如)AnagramKey 和 AnagramKeyFactory。将应用程序的其余部分设计为与所使用的密钥类型无关。
创建 AnagramKey 接口的一种实现,在内部使用
int[]
来表示字符计数。创建 AnagramKey 接口的第二个实现,它使用
HashMap
来表示字符计数。创建对应的工厂接口。
在使用命令行参数、区域设置或其他参数表示键的两种方式之间进行选择。
注:
目前尚不清楚“字谜”在非字母语言的上下文中是否有意义,或者对于将多种语言混合成“句子”的话语是否有意义。另外,我不知道(例如)法语中的字谜是否会忽略字符上的重音。无论如何,我很想将所有这些情况都视为“超出范围”......除非您有明确的要求来支持它们。
目前
int[]
使用的空间少于HashMap
时的盈亏平衡密度在 15 分之一的范围内渐近地约为 1 个字符计数数组中的字符。 (具有这些键/值类型的 HashMap 中的每个条目占用 15 个 32 位字的区域。)并且这没有考虑HashMap
节点和哈希数组节点的开销...如果对字谜词的长度进行限制,则可以通过使用
short[]
甚至byte[]
来节省更多空间字符数。Building in @Karl Knechtel's answer (and addressing your concern about supporting multiple alphabets):
Create interfaces (say) AnagramKey and AnagramKeyFactory. Design the rest of the application to be agnostic of the type of key used.
Create one implementation of the AnagramKey interface that internally uses an
int[]
to represent the character counts.Create a second implementation of the AnagramKey interface that uses a
HashMap<Character, Integer>
to represent the character counts.Create the corresponding factory interfaces.
Choose between the two ways of representing the keys using a command line parameter, the Locale, or something else.
Notes:
It is not clear that "anagrams" make sense in the context of non-alphabetic languages, or for utterances that mix multiple languages into a "sentence". Also, I don't know whether anagrams in (say) French ignore the accents on characters. At any rate, I would be tempted to rule all of these cases as "out of scope" ... unless you have an explicit requirement to support them.
The break-even density at which an
int[]
uses less space than aHashMap<Character, Integer>
is asymptotically around 1 character in 15 across the range of characters in your count array. (Each entry in a HashMap with those key/value types occupies in the region of 15 32-bit words.) And that doesn't take account of the overheads of theHashMap
node and the hash array node ...If you place limits on the length of the anagrams, you can save more space by using a
short[]
or even abyte[]
for the character counts.