当前位置：文江博客话题详情

高级 Java 优化

发布于 2024-11-29 03:54:29 字数 419 浏览 3 评论 0原文

关于如何使用 for、while 和 do-while 循环进行低级 Java 优化，以及是否有必要，存在许多问题、答案和意见。

我的问题更多的是基于高级的设计优化。假设我必须执行以下操作：

对于给定的字符串输入，计算字符串中每个字母的出现次数。

当字符串是几个句子时，这不是一个主要问题，但如果我们想要计算 900,000 个单词的文件中每个单词的出现次数，该怎么办？构建循环只会浪费时间。

那么可以应用于此类问题的高级设计模式是什么？

我想我的主要观点是我倾向于使用循环来解决很多问题，并且我想改掉使用循环的习惯。

预先感谢

Sam

p.s.如果可能的话，你能生成一些伪代码来解决 900,000 字的文件问题，我倾向于理解代码比理解英语更好，我认为这对于本网站的大多数访问者来说都是一样的

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

披肩女神 2024-12-06 03:54:29

字数统计问题是大数据世界中覆盖最广泛的问题之一；它有点像 Hadoop 等框架的 Hello World。您可以在网络上找到有关此问题的大量信息。

无论如何，我会给你一些想法。

首先，900000 个单词可能仍然足够小，无法为其构建哈希图，因此不要忽视明显的内存中方法。你说伪代码很好，所以：

h = new HashMap<String, Integer>();
for each word w picked up while tokenizing the file {
  h[w] = w in h ? h[w]++ : 1
}

现在，一旦你的数据集太大而无法构建内存中的哈希图，你可以像这样进行计数：

Tokenize into words writing each word to a single line in a file
Use the Unix sort command to produce the next file
Count as you traverse the sorted file

这三个步骤在 Unix 管道中进行。让操作系统在这里为您完成工作。

现在，当您获得更多数据时，您希望引入 hadoop 等 Map-Reduce 框架来对机器集群进行字数统计。

现在，我听说当你进入非常大的数据集时，在分布式环境中做事不再有帮助，因为传输时间压倒了计数时间，并且在字数计数的情况下，所有内容都必须“重新组合在一起”无论如何”，所以你必须使用一些非常复杂的技术，我怀疑你可以在研究论文中找到这些技术。

ADDENDUM

OP 要求提供一个在 Java 中标记输入的示例。这是最简单的方法：

import java.util.Scanner;
public class WordGenerator {
    /**
     * Tokenizes standard input into words, writing each word to standard output,
     * on per line.  Because it reads from standard input and writes to standard
     * output, it can easily be used in a pipeline combined with sort, uniq, and
     * any other such application.
     */
    public static void main(String[] args) {
        Scanner input = new Scanner(System.in);
        while (input.hasNext()) {
            System.out.println(input.next().toLowerCase());
        }
    } 
}

现在这是一个使用它的示例：

echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator

此输出

hey
moe!
woo
woo
woo
nyuk-nyuk
why
soitenly.
hey.

您可以将此分词器与排序和 uniq 结合起来，如下所示：

echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator | sort | uniq

Yielding

hey
hey.
moe!
nyuk-nyuk
soitenly.
why
woo

Now 如果您只想保留字母并扔掉所有标点符号、数字和其他字符，请更改您的扫描仪定义行为：

Scanner input = new Scanner(System.in).useDelimiter(Pattern.compile("\\P{L}"));

现在

echo -e "Hey Moe! Woo\nwoo woo^nyuk-nyuk why#2soitenly. Hey." | java WordGenerator | sort | uniq

产量

hey
moe
nyuk
soitenly
why
woo

输出中有一个空行；我会让你弄清楚如何打击它。 :)

The word count problem is one of the most widely covered problems in the Big Data world; it's kind of the Hello World of frameworks like Hadoop. You can find ample information throughout the web on this problem.

I'll give you some thoughts on it anyway.

First, 900000 words might still be small enough to build a hashmap for, so don't discount the obvious in-memory approach. You said pseudocode is fine, so:

h = new HashMap<String, Integer>();
for each word w picked up while tokenizing the file {
  h[w] = w in h ? h[w]++ : 1
}

Now once your dataset is too large to build an in-memory hashmap, you can do your counting like so:

Tokenize into words writing each word to a single line in a file
Use the Unix sort command to produce the next file
Count as you traverse the sorted file

These three steps go in a Unix pipeline. Let the OS do the work for you here.

Now, as you get even more data, you want to bring in map-reduce frameworks like hadoop to do the word counting on clusters of machines.

Now, I've heard when you get into obscenely large datasets, doing things in a distributed enviornment does not help anymore, because the transmission time overwhelms the counting time, and in your case of word counting, everything has to "be put back together anyway" so then you have to use some very sophisticated techniques that I suspect you can find in research papers.

ADDENDUM

The OP asked for an example of tokenizing the input in Java. Here is the easiest way:

import java.util.Scanner;
public class WordGenerator {
    /**
     * Tokenizes standard input into words, writing each word to standard output,
     * on per line.  Because it reads from standard input and writes to standard
     * output, it can easily be used in a pipeline combined with sort, uniq, and
     * any other such application.
     */
    public static void main(String[] args) {
        Scanner input = new Scanner(System.in);
        while (input.hasNext()) {
            System.out.println(input.next().toLowerCase());
        }
    } 
}

Now here is an example of using it:

echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator

This outputs

hey
moe!
woo
woo
woo
nyuk-nyuk
why
soitenly.
hey.

You can combine this tokenizer with sort and uniq like so:

echo -e "Hey Moe! Woo\nwoo woo nyuk-nyuk why soitenly. Hey." | java WordGenerator | sort | uniq

Yielding

hey
hey.
moe!
nyuk-nyuk
soitenly.
why
woo

Now if you only want to keep letters and throw away all punctuation, digits and other characters, change your scanner definition line to:

Scanner input = new Scanner(System.in).useDelimiter(Pattern.compile("\\P{L}"));

And now

echo -e "Hey Moe! Woo\nwoo woo^nyuk-nyuk why#2soitenly. Hey." | java WordGenerator | sort | uniq

Yields

hey
moe
nyuk
soitenly
why
woo

There is a blank line in the output; I'll let you figure out how to whack it. :)

回复收藏 0 原文

翻身的咸鱼 2024-12-06 03:54:29

最快的解决方案是 O(n) AFAIK 使用循环来迭代字符串，获取字符并相应地更新 HashMap 中的计数。最后，HashMap 包含所有出现的字符以及所有出现次数的计数。

一些伪代码（可能无法编译）

HashMap<Character, Integer> map = new HashMap<Character, Integer>();
for (int i = 0; i < str.length(); i++)
{
    char c = str.charAt(i);
    if (map.containsKey(c)) map.put(c, map.get(c) + 1);
    else map.put(c, 1);
}

The fastest solution to this is O(n) AFAIK use a loop to iterate the string, get the character and update the count in a HashMap accordingly. At the end the HashMap contains all the characters that occurred and a count of all the occurrences.

Some pseduo-code (may not compile)

HashMap<Character, Integer> map = new HashMap<Character, Integer>();
for (int i = 0; i < str.length(); i++)
{
    char c = str.charAt(i);
    if (map.containsKey(c)) map.put(c, map.get(c) + 1);
    else map.put(c, 1);
}

回复收藏 0 原文

流年已逝 2024-12-06 03:54:29

对于您来说，很难找到比使用循环更好的方法来解决这个问题。 IMO，加速此类操作的最佳方法是将工作负载拆分为不同的工作单元，并使用不同的处理器处理工作单元（例如，如果您有多处理器计算机，则使用线程）。

回复收藏 0 原文

以可爱出名 2024-12-06 03:54:29

您不应该认为 900,000 个字数很多。如果您的 CPU 有 8 个线程和 3 GHZ，则每秒有 240 亿个时钟周期。 ;)

但是，使用 int[] 来计算字符会快得多。仅有 65,536 个可能的字符。

StringBuilder words = new StringBuilder();
Random rand = new Random();
for (int i = 0; i < 10 * 1000 * 1000; i++)
    words.append(Long.toString(rand.nextLong(), 36)).append(' ');
String text = words.toString();

long start = System.nanoTime();
int[] charCount = new int[Character.MAX_VALUE];
for (int i = 0; i < text.length(); i++)
    charCount[text.charAt(i)]++;
long time = System.nanoTime() - start;
System.out.printf("Took %,d ms to count %,d characters%n", time / 1000/1000, text.length());

打印

Took 111 ms to count 139,715,647 characters

即使是字数的 11 倍也只需要几分之一秒的时间。

更长的并行版本会更快一些。

public static void main(String... args) throws InterruptedException, ExecutionException {
    StringBuilder words = new StringBuilder();
    Random rand = new Random();
    for (int i = 0; i < 10 * 1000 * 1000; i++)
        words.append(Long.toString(rand.nextLong(), 36)).append(' ');
    final String text = words.toString();

    long start = System.nanoTime();
    // start a thread pool to generate 4 tasks to count sections of the text.
    final int nThreads = 4;
    ExecutorService es = Executors.newFixedThreadPool(nThreads);
    List<Future<int[]>> results = new ArrayList<Future<int[]>>();
    int blockSize = (text.length() + nThreads - 1) / nThreads;
    for (int i = 0; i < nThreads; i++) {
        final int min = i * blockSize;
        final int max = Math.min(min + blockSize, text.length());
        results.add(es.submit(new Callable<int[]>() {
            @Override
            public int[] call() throws Exception {
                int[] charCount = new int[Character.MAX_VALUE];
                for (int j = min; j < max; j++)
                    charCount[text.charAt(j)]++;
                return charCount;
            }
        }));
    }
    es.shutdown();
    // combine the results.
    int[] charCount = new int[Character.MAX_VALUE];
    for (Future<int[]> resultFuture : results) {
        int[] result = resultFuture.get();
        for (int i = 0, resultLength = result.length; i < resultLength; i++) {
            charCount[i] += result[i];
        }
    }
    long time = System.nanoTime() - start;
    System.out.printf("Took %,d ms to count %,d characters%n", time / 1000 / 1000, text.length());
}

但对于一个少于一百万字的字符串

Took 45 ms to count 139,715,537 characters

来说，它可能不值得。

You shouldn't assume 900,000 is a lot of words. If you have a CPU with 8 threads and 3 GHZ that's 24 billion clock cycles per second. ;)

However for counting characters using an int[] will be much faster. There is only 65,536 possible characters.

StringBuilder words = new StringBuilder();
Random rand = new Random();
for (int i = 0; i < 10 * 1000 * 1000; i++)
    words.append(Long.toString(rand.nextLong(), 36)).append(' ');
String text = words.toString();

long start = System.nanoTime();
int[] charCount = new int[Character.MAX_VALUE];
for (int i = 0; i < text.length(); i++)
    charCount[text.charAt(i)]++;
long time = System.nanoTime() - start;
System.out.printf("Took %,d ms to count %,d characters%n", time / 1000/1000, text.length());

prints

Took 111 ms to count 139,715,647 characters

Even 11x times the number of words takes a fraction of a second.

A much longer parallel version is a little faster.

public static void main(String... args) throws InterruptedException, ExecutionException {
    StringBuilder words = new StringBuilder();
    Random rand = new Random();
    for (int i = 0; i < 10 * 1000 * 1000; i++)
        words.append(Long.toString(rand.nextLong(), 36)).append(' ');
    final String text = words.toString();

    long start = System.nanoTime();
    // start a thread pool to generate 4 tasks to count sections of the text.
    final int nThreads = 4;
    ExecutorService es = Executors.newFixedThreadPool(nThreads);
    List<Future<int[]>> results = new ArrayList<Future<int[]>>();
    int blockSize = (text.length() + nThreads - 1) / nThreads;
    for (int i = 0; i < nThreads; i++) {
        final int min = i * blockSize;
        final int max = Math.min(min + blockSize, text.length());
        results.add(es.submit(new Callable<int[]>() {
            @Override
            public int[] call() throws Exception {
                int[] charCount = new int[Character.MAX_VALUE];
                for (int j = min; j < max; j++)
                    charCount[text.charAt(j)]++;
                return charCount;
            }
        }));
    }
    es.shutdown();
    // combine the results.
    int[] charCount = new int[Character.MAX_VALUE];
    for (Future<int[]> resultFuture : results) {
        int[] result = resultFuture.get();
        for (int i = 0, resultLength = result.length; i < resultLength; i++) {
            charCount[i] += result[i];
        }
    }
    long time = System.nanoTime() - start;
    System.out.printf("Took %,d ms to count %,d characters%n", time / 1000 / 1000, text.length());
}

prints

Took 45 ms to count 139,715,537 characters

But for a String with less than a million words its not likely to be worth it.

回复收藏 0 原文

倾城泪 2024-12-06 03:54:29

作为一般规则，您应该以简单的方式编写内容，然后进行性能调整以使其尽可能快。
如果这意味着采用更快的算法，那就这样做，但首先要保持简单。
对于这样的小程序来说，并不会太难。

性能调优的基本技巧是而不是猜测。
相反，让程序本身告诉您要修复什么。
这是我的方法。

如需了解更多相关计划，像这个，经验会告诉你如何避免过度 -认为这最终会导致许多它试图避免的糟糕表现。

回复收藏 0 原文

滿滿的愛 2024-12-06 03:54:29

您必须使用分而治之的方法并避免资源竞争。对此有不同的方法和/或实现。想法是相同的 - 分割工作并并行处理。

在单台机器上，您可以在单独的线程中处理数据块，尽管将数据块放在同一磁盘上会大大减慢速度。 H 拥有更多的线程意味着有更多的上下文切换，恕我直言，吞吐量最好是拥有更少的线程并让它们保持忙碌。

您可以将处理分为多个阶段，并使用SEDA或类似的东西，并且对于map-reduce所做的真正大数据 - 只需计算跨集群分发数据的费用即可。

我会很高兴有人指出另一个广泛使用的 API。

回复收藏 0 原文

~没有更多了~

关于作者

迟到的我

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

高级 Java 优化

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

高级 Java 优化

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。