当前位置：文江博客话题详情

字符串数组仅包含字谜？

发布于 2024-12-08 02:05:20 字数 745 浏览 2 评论 0原文

有人给我做了一个关于字谜的练习，它看起来非常简单，我怀疑我错过了什么。我实施的解决方案是我很快就会介绍的解决方案，我想问您是否可以想到我的解决方案有任何优化、方法改变或问题。我用Java实现了该算法。

现在，练习。作为输入，我有一个文本，作为输出，我应该返回该文本的每一行是否是每一行的字谜。也就是说，对于输入：

A Cab Deed Huffiest Minnows Loll
出租车契约哈菲斯特米诺洛尔
出租车契约洗牌了数百万美元
出租车契约洗牌了百万镇

程序应该返回 True。输入：

出租车契约 Huffiest Minnows Loll
出租车契约哈菲斯特米诺闲逛
出租车契约洗牌了数百万美元
A Cab Deed Shuffles Million Town 的

输出必须为 False（当然是因为第二行）。

现在，我的想法非常简单：

我创建 2 个 HashMap：ref 和 cur。
我解析文本的第一行，填充 ref。我只会计算字母。
对于每一行，我将该行解析为 cur 并检查是否 cur.equals(ref): if so return false
如果我到达文本末尾，这意味着每一行都是每一行的字谜，所以我返回真。

而且……就是这样。我尝试输入 88000 行的文本，速度相当快。

有什么意见吗？建议？优化？

非常感谢您的帮助。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

左耳近心 2024-12-15 02:05:20

另一种选择是：

从字符串中删除您不关心的所有字符（标点符号、空格）
将其设为小写
对字符串进行排序
与参考字符串进行比较（使用 .equals）

我怀疑您的方法更快尽管。

编辑：

由于@nibot 不同意我的建议，而且我不是一个在没有证据的情况下来回争论的人，这里有三个解决方案。

它们的实现都非常相似：

将行转换为小写
忽略非字母字符
？
检查 3. 的结果是否与第一行的结果匹配

？部分是以下之一：

制作字符计数的 HashMap 对
字符进行排序
制作一个 26-int 数组（终极哈希表解决方案，但仅适用于拉丁字母）

我用以下

public static void time(String name, int repetitions, Function function,
        int expectedResult) throws Exception {
    long total = 0;
    for (int i = 0; i < repetitions; i++) {
        System.gc();
        long start = System.currentTimeMillis();
        int result = function.call();
        long end = System.currentTimeMillis();
        if (result != expectedResult) {
            System.out.println("Oops, " + name + " is broken");
            return;
        }
        total += end - start;
    }
    System.out.println("Executution of " + name + " took "
            + (total / repetitions) + " ms on average");
}

命令运行它们：文件与 OP 发布的文件类似，但明显更长，并在末尾添加了约 20 行的非字谜，以确保算法全部正常工作。

我一直得到这样的结果：

Execution of testWithHashMap took 158 ms on average
Execution of testWithSorting took 76 ms on average
Execution of testWithArray took 56 ms on average

如果满足以下条件，则 HashMap 可以得到显着改进：

有一种方法可以制作 HashMap
有一种方法可以指定 a 中的默认值HashMap 以及一种获取和递增的方法（因此只有一次查找而不是 2 次）

但是，这些不在标准库中，所以我忽略它们（就像大多数使用 Java 的程序员会）。

这个故事的寓意是大O并不代表一切。您需要考虑开销和n的大小。在这种情况下，n 相当小，并且 HashMap 的开销非常大。如果线路更长，情况可能会改变，但不幸的是，我不想弄清楚盈亏平衡点在哪里。

如果您仍然不相信我，请考虑 GCC 使用 在某些情况下，插入排序在其 C++ 标准库中。

Another option is:

Strip all characters you don't care about from the string (punctuation, whitespace)
Make it lowercase
Sort the string
Compare to the reference string (with .equals)

I suspect your way is faster though.

EDIT:

Since @nibot disagrees with my even suggesting this, and I'm not one to argue back and forth without proof, here's three solutions.

They're all implemented very similarly:

Convert line to lowercase
Ignore non-alphabetic characters
?
Check of the result of 3. matches the result from the first line

The ? part is one of:

Make a HashMap of character counts
Sorting the characters
Making a 26-int array (the ultimate hash table solution, but only works for the Latin alphabet)

I ran them all with this:

public static void time(String name, int repetitions, Function function,
        int expectedResult) throws Exception {
    long total = 0;
    for (int i = 0; i < repetitions; i++) {
        System.gc();
        long start = System.currentTimeMillis();
        int result = function.call();
        long end = System.currentTimeMillis();
        if (result != expectedResult) {
            System.out.println("Oops, " + name + " is broken");
            return;
        }
        total += end - start;
    }
    System.out.println("Executution of " + name + " took "
            + (total / repetitions) + " ms on average");
}

My file is similar to the one the OP posted, but made significantly longer, with a non-anagram about 20 lines from the end to ensure that the algorithms all work.

I consistently get results like this:

Execution of testWithHashMap took 158 ms on average
Execution of testWithSorting took 76 ms on average
Execution of testWithArray took 56 ms on average

The HashMap one could be significantly improved if:

There was a way to make a HashMap<char, int>
There was a way to specify the default value of in a HashMap and a way to get-and-increment (so there would only be one lookup instead of 2)

But, these aren't in the standard library, so I'm ignoring them (just like most programmers using Java would).

The moral of the story is that big O isn't everything. You need to consider the overhead and the size of n. In this case, n is fairly small, and the overhead of a HashMap is significant. With longer lines, that would likely change, but unfortunately I don't feel like figuring out where the break-even point is.

And if you still don't believe me, consider that GCC uses insertion sort in some cases in its C++ standard library.

回复收藏 0 原文

烟雨凡馨 2024-12-15 02:05:20

假设你的 HashMap 是从 (character) -> 的映射（字符串中出现的次数），你几乎已经拥有了。

我认为您应该忽略空格和标点符号，并将大写字母和小写字母视为相同。如果您不使用英语以外的任何语言，那么 HashMap 就有点过分了：您可以简单地使用代表 A..Z 的 26 个计数的数组。如果你需要支持 Unicode 那么问题当然要复杂得多，因为你不仅需要处理可能数千种不同类型的字母，而且还必须定义“字母”（幸运的是存在字符属性数据对此有帮助）和“小写/大写”（请注意，有些语言没有大小写，有些可以将两个小写字母映射为一个大写字母，反之亦然......）。更不用说标准化了:)

回复收藏 0 原文

冰魂雪魄 2024-12-15 02:05:20

构建 @Karl Knechtel 的答案（并解决您对支持多个字母表的担忧）：

创建接口（例如）AnagramKey 和 AnagramKeyFactory。将应用程序的其余部分设计为与所使用的密钥类型无关。
创建 AnagramKey 接口的一种实现，在内部使用 int[] 来表示字符计数。
创建 AnagramKey 接口的第二个实现，它使用 HashMap 来表示字符计数。
创建对应的工厂接口。
在使用命令行参数、区域设置或其他参数表示键的两种方式之间进行选择。

注：

目前尚不清楚“字谜”在非字母语言的上下文中是否有意义，或者对于将多种语言混合成“句子”的话语是否有意义。另外，我不知道（例如）法语中的字谜是否会忽略字符上的重音。无论如何，我很想将所有这些情况都视为“超出范围”......除非您有明确的要求来支持它们。
目前
int[] 使用的空间少于 HashMap 时的盈亏平衡密度在 15 分之一的范围内渐近地约为 1 个字符计数数组中的字符。（具有这些键/值类型的 HashMap 中的每个条目占用 15 个 32 位字的区域。）并且这没有考虑 HashMap 节点和哈希数组节点的开销...
如果对字谜词的长度进行限制，则可以通过使用 short[] 甚至 byte[] 来节省更多空间字符数。