字符串数组仅包含字谜?

发布于 2024-12-08 02:05:20 字数 745 浏览 2 评论 0原文

有人给我做了一个关于字谜的练习,它看起来非常简单,我怀疑我错过了什么。 我实施的解决方案是我很快就会介绍的解决方案,我想问您是否可以想到我的解决方案有任何优化、方法改变或问题。 我用Java实现了该算法。

现在,练习。 作为输入,我有一个文本,作为输出,我应该返回该文本的每一行是否是每一行的字谜。 也就是说,对于输入:

A Cab Deed Huffiest Minnows Loll
出租车契​​约哈菲斯特米诺洛尔
出租车契​​约洗牌了数百万美元
出租车契​​约洗牌了百万镇

程序应该返回 True。 输入:

出租车契约 Huffiest Minnows Loll
出租车契​​约哈菲斯特米诺闲逛
出租车契​​约洗牌了数百万美元
A Cab Deed Shuffles Million Town 的

输出必须为 False(当然是因为第二行)。

现在,我的想法非常简单:

  • 我创建 2 个 HashMap:ref 和 cur。
  • 我解析文本的第一行,填充 ref。我只会计算字母。
  • 对于每一行,我将该行解析为 cur 并检查是否 cur.equals(ref): if so return false
  • 如果我到达文本末尾,这意味着每一行都是每一行的字谜,所以我返回真。

而且……就是这样。 我尝试输入 88000 行的文本,速度相当快。

有什么意见吗?建议?优化?

非常感谢您的帮助。

I have been given an exercise about anagrams and it looked so really easy that I have a doubt I am missing something.
The solution I implemented is the one I will present shortly, and I wanted to ask you if you could think of any optimization, change of approach or problem with my solution.
I implemented the algorithm in Java.

Now, the exercise.
As input I have a text and as output I should return whether each line of this text is an anagram of each other line.
That is, for input:

A Cab Deed Huffiest Minnows Loll
A Cab Deed Huffiest Minnow Lolls
A Cab Deed Shuffles Million Wont
A Cab Deed Shuffles Million Town

The program should return True.
For input:

A Cab Deed Huffiest Minnows Loll
A Cab Deed Huffiest Minnow Lolls hi
A Cab Deed Shuffles Million Wont
A Cab Deed Shuffles Million Town

the output will have to be False (because of the second line, of course).

Now, what I thought is pretty straightforward:

  • I create 2 HashMap: ref and cur.
  • I parse the first line of the text, filling ref. I will count only alphabetical letters.
  • for each other line, I parse the line into cur and check if cur.equals(ref): if so return false
  • if I get to the end of the text, it means that each line is an anagram of each other line, so I return true.

And...this would be it.
I tried it with an input text of 88000 lines, and it works pretty fast.

Any comments? Suggestions? Optimizations?

Thank you very much for the help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

左耳近心 2024-12-15 02:05:20

另一种选择是:

  1. 从字符串中删除您不关心的所有字符(标点符号、空格)
  2. 将其设为小写
  3. 对字符串进行排序
  4. 与参考字符串进行比较(使用 .equals

我怀疑您的方法更快尽管。

编辑:

由于@nibot 不同意我的建议,而且我不是一个在没有证据的情况下来回争论的人,这里有三个解决方案

它们的实现都非常相似:

  1. 将行转换为小写
  2. 忽略非字母字符
  3. 检查 3. 的结果是否与第一行的结果匹配

?部分是以下之一:

  • 制作字符计数的 HashMap
  • 字符进行排序
  • 制作一个 26-int 数组(终极哈希表解决方案,但仅适用于拉丁字母)

我用以下

public static void time(String name, int repetitions, Function function,
        int expectedResult) throws Exception {
    long total = 0;
    for (int i = 0; i < repetitions; i++) {
        System.gc();
        long start = System.currentTimeMillis();
        int result = function.call();
        long end = System.currentTimeMillis();
        if (result != expectedResult) {
            System.out.println("Oops, " + name + " is broken");
            return;
        }
        total += end - start;
    }
    System.out.println("Executution of " + name + " took "
            + (total / repetitions) + " ms on average");
}

命令 运行它们:文件与 OP 发布的文件类似,但明显更长,并在末尾添加了约 20 行的非字谜,以确保算法全部正常工作。

我一直得到这样的结果:

Execution of testWithHashMap took 158 ms on average
Execution of testWithSorting took 76 ms on average
Execution of testWithArray took 56 ms on average

如果满足以下条件,则 HashMap 可以得到显着改进:

但是,这些不在标准库中,所以我忽略它们(就像大多数使用 Java 的程序员会)。

这个故事的寓意是大O并不代表一切。您需要考虑开销和n的大小。在这种情况下,n 相当小,并且 HashMap 的开销非常大。如果线路更长,情况可能会改变,但不幸的是,我不想弄清楚盈亏平衡点在哪里。

如果您仍然不相信我,请考虑 GCC 使用 在某些情况下,插入排序在其 C++ 标准库中。

Another option is:

  1. Strip all characters you don't care about from the string (punctuation, whitespace)
  2. Make it lowercase
  3. Sort the string
  4. Compare to the reference string (with .equals)

I suspect your way is faster though.

EDIT:

Since @nibot disagrees with my even suggesting this, and I'm not one to argue back and forth without proof, here's three solutions.

They're all implemented very similarly:

  1. Convert line to lowercase
  2. Ignore non-alphabetic characters
  3. ?
  4. Check of the result of 3. matches the result from the first line

The ? part is one of:

  • Make a HashMap of character counts
  • Sorting the characters
  • Making a 26-int array (the ultimate hash table solution, but only works for the Latin alphabet)

I ran them all with this:

public static void time(String name, int repetitions, Function function,
        int expectedResult) throws Exception {
    long total = 0;
    for (int i = 0; i < repetitions; i++) {
        System.gc();
        long start = System.currentTimeMillis();
        int result = function.call();
        long end = System.currentTimeMillis();
        if (result != expectedResult) {
            System.out.println("Oops, " + name + " is broken");
            return;
        }
        total += end - start;
    }
    System.out.println("Executution of " + name + " took "
            + (total / repetitions) + " ms on average");
}

My file is similar to the one the OP posted, but made significantly longer, with a non-anagram about 20 lines from the end to ensure that the algorithms all work.

I consistently get results like this:

Execution of testWithHashMap took 158 ms on average
Execution of testWithSorting took 76 ms on average
Execution of testWithArray took 56 ms on average

The HashMap one could be significantly improved if:

But, these aren't in the standard library, so I'm ignoring them (just like most programmers using Java would).

The moral of the story is that big O isn't everything. You need to consider the overhead and the size of n. In this case, n is fairly small, and the overhead of a HashMap is significant. With longer lines, that would likely change, but unfortunately I don't feel like figuring out where the break-even point is.

And if you still don't believe me, consider that GCC uses insertion sort in some cases in its C++ standard library.

烟雨凡馨 2024-12-15 02:05:20

假设你的 HashMap 是从 (character) -> 的映射(字符串中出现的次数),你几乎已经拥有了。

我认为您应该忽略空格和标点符号,并将大写字母和小写字母视为相同。如果您不使用英语以外的任何语言,那么 HashMap 就有点过分了:您可以简单地使用代表 A..Z 的 26 个计数的数组。如果你需要支持 Unicode 那么问题当然要复杂得多,因为你不仅需要处理可能数千种不同类型的字母,而且还必须定义“字母”(幸运的是存在字符属性数据对此有帮助)和“小写/大写”(请注意,有些语言没有大小写,有些可以将两个小写字母映射为一个大写字母,反之亦然......)。更不用说标准化了:)

Assuming that your HashMap is a mapping from (character) -> (number of occurrences in the string), you pretty much have it.

I assume that you're supposed to ignore whitespace and punctuation, and treat uppercase and lowercase letters as the same. If you're not using any languages other than English, then a HashMap is overkill: you can simply use an array of 26 counts representing A..Z. If you need to support Unicode then the problem is of course much more complicated, as not only do you need to deal with possibly thousands of different kinds of letters, but you also have to define 'letter' (fortunately there exists character property data that helps with this) and 'lowercase/uppercase' (note that some languages don't have case, some can map two lowercase letters into a single uppercase one or vice-versa...). To say nothing of normalization :)

冰魂雪魄 2024-12-15 02:05:20

构建 @Karl Knechtel 的答案(并解决您对支持多个字母表的担忧):

  • 创建接口(例如)AnagramKey 和 AnagramKeyFactory。将应用程序的其余部分设计为与所使用的密钥类型无关。

  • 创建 AnagramKey 接口的一种实现,在内部使用 int[] 来表示字符计数。

  • 创建 AnagramKey 接口的第二个实现,它使用 HashMap 来表示字符计数。

  • 创建对应的工厂接口。

  • 在使用命令行参数、区域设置或其他参数表示键的两种方式之间进行选择。

注:

  1. 目前尚不清楚“字谜”在非字母语言的上下文中是否有意义,或者对于将多种语言混合成“句子”的话语是否有意义。另外,我不知道(例如)法语中的字谜是否会忽略字符上的重音。无论如何,我很想将所有这些情况都视为“超出范围”......除非您有明确的要求来支持它们。

    目前

  2. int[] 使用的空间少于 HashMap 时的盈亏平衡密度在 15 分之一的范围内渐近地约为 1 个字符计数数组中的字符。 (具有这些键/值类型的 HashMap 中的每个条目占用 15 个 32 位字的区域。)并且这没有考虑 HashMap 节点和哈希数组节点的开销...

  3. 如果对字谜词的长度进行限制,则可以通过使用 short[] 甚至 byte[] 来节省更多空间字符数。

Building in @Karl Knechtel's answer (and addressing your concern about supporting multiple alphabets):

  • Create interfaces (say) AnagramKey and AnagramKeyFactory. Design the rest of the application to be agnostic of the type of key used.

  • Create one implementation of the AnagramKey interface that internally uses an int[] to represent the character counts.

  • Create a second implementation of the AnagramKey interface that uses a HashMap<Character, Integer> to represent the character counts.

  • Create the corresponding factory interfaces.

  • Choose between the two ways of representing the keys using a command line parameter, the Locale, or something else.

Notes:

  1. It is not clear that "anagrams" make sense in the context of non-alphabetic languages, or for utterances that mix multiple languages into a "sentence". Also, I don't know whether anagrams in (say) French ignore the accents on characters. At any rate, I would be tempted to rule all of these cases as "out of scope" ... unless you have an explicit requirement to support them.

  2. The break-even density at which an int[] uses less space than a HashMap<Character, Integer> is asymptotically around 1 character in 15 across the range of characters in your count array. (Each entry in a HashMap with those key/value types occupies in the region of 15 32-bit words.) And that doesn't take account of the overheads of the HashMap node and the hash array node ...

  3. If you place limits on the length of the anagrams, you can save more space by using a short[] or even a byte[] for the character counts.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文