高级 Java 优化
关于如何使用 for、while 和 do-while 循环进行低级 Java 优化,以及是否有必要,存在许多问题、答案和意见。
我的问题更多的是基于高级的设计优化。假设我必须执行以下操作:
对于给定的字符串输入,计算字符串中每个字母的出现次数。
当字符串是几个句子时,这不是一个主要问题,但如果我们想要计算 900,000 个单词的文件中每个单词的出现次数,该怎么办?构建循环只会浪费时间。
那么可以应用于此类问题的高级设计模式是什么?
我想我的主要观点是我倾向于使用循环来解决很多问题,并且我想改掉使用循环的习惯。
预先感谢
Sam
p.s.如果可能的话,你能生成一些伪代码来解决 900,000 字的文件问题,我倾向于理解代码比理解英语更好,我认为这对于本网站的大多数访问者来说都是一样的
There are many questions and answers and opinions about how to do low level Java optimization, with for, while, and do-while loops, and whether it's even necessary.
My question is more of a High Level based optimization in design. Let's assume I have to do the following:
for a given string input, count the occurrence of each letter in the string.
this is not a major problem when the string is a few sentences, but what if instead we want to count the occurrence of each word in a 900,000 word file. building loops just wastes time.
So what is the high level design pattern that can be applied to this type of problem.
I guess my major point is that I tend to use loops to solve many problems, and I would like to get out of the habit of using loops.
thanks in advance
Sam
p.s. If possible can you produce some pseudo code for solving the 900,000 word file problem, I tend to understand code better than I can understand English, which I assume is the same for most visitors of this site
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
字数统计问题是大数据世界中覆盖最广泛的问题之一;它有点像 Hadoop 等框架的 Hello World。您可以在网络上找到有关此问题的大量信息。
无论如何,我会给你一些想法。
首先,900000 个单词可能仍然足够小,无法为其构建哈希图,因此不要忽视明显的内存中方法。你说伪代码很好,所以:
现在,一旦你的数据集太大而无法构建内存中的哈希图,你可以像这样进行计数:
这三个步骤在 Unix 管道中进行。让操作系统在这里为您完成工作。
现在,当您获得更多数据时,您希望引入 hadoop 等 Map-Reduce 框架来对机器集群进行字数统计。
现在,我听说当你进入非常大的数据集时,在分布式环境中做事不再有帮助,因为传输时间压倒了计数时间,并且在字数计数的情况下,所有内容都必须“重新组合在一起”无论如何”,所以你必须使用一些非常复杂的技术,我怀疑你可以在研究论文中找到这些技术。
ADDENDUM
OP 要求提供一个在 Java 中标记输入的示例。这是最简单的方法:
现在这是一个使用它的示例:
此输出
您可以将此分词器与排序和 uniq 结合起来,如下所示:
Yielding
Now 如果您只想保留字母并扔掉所有标点符号、数字和其他字符,请更改您的扫描仪定义行为:
现在
产量
输出中有一个空行;我会让你弄清楚如何打击它。 :)
The word count problem is one of the most widely covered problems in the Big Data world; it's kind of the Hello World of frameworks like Hadoop. You can find ample information throughout the web on this problem.
I'll give you some thoughts on it anyway.
First, 900000 words might still be small enough to build a hashmap for, so don't discount the obvious in-memory approach. You said pseudocode is fine, so:
Now once your dataset is too large to build an in-memory hashmap, you can do your counting like so:
These three steps go in a Unix pipeline. Let the OS do the work for you here.
Now, as you get even more data, you want to bring in map-reduce frameworks like hadoop to do the word counting on clusters of machines.
Now, I've heard when you get into obscenely large datasets, doing things in a distributed enviornment does not help anymore, because the transmission time overwhelms the counting time, and in your case of word counting, everything has to "be put back together anyway" so then you have to use some very sophisticated techniques that I suspect you can find in research papers.
ADDENDUM
The OP asked for an example of tokenizing the input in Java. Here is the easiest way:
Now here is an example of using it:
This outputs
You can combine this tokenizer with sort and uniq like so:
Yielding
Now if you only want to keep letters and throw away all punctuation, digits and other characters, change your scanner definition line to:
And now
Yields
There is a blank line in the output; I'll let you figure out how to whack it. :)
最快的解决方案是 O(n) AFAIK 使用循环来迭代字符串,获取字符并相应地更新 HashMap 中的计数。最后,HashMap 包含所有出现的字符以及所有出现次数的计数。
一些伪代码(可能无法编译)
The fastest solution to this is O(n) AFAIK use a loop to iterate the string, get the character and update the count in a HashMap accordingly. At the end the HashMap contains all the characters that occurred and a count of all the occurrences.
Some pseduo-code (may not compile)
对于您来说,很难找到比使用循环更好的方法来解决这个问题。 IMO,加速此类操作的最佳方法是将工作负载拆分为不同的工作单元,并使用不同的处理器处理工作单元(例如,如果您有多处理器计算机,则使用线程)。
It's hard for you to get much better than using a loop to solve this problem. IMO, the best way to speed up this sort of operation is to split the workload into different units of work and process the units of work with different processors (using threads, for example, if you have a multiprocessor computer).
您不应该认为 900,000 个字数很多。如果您的 CPU 有 8 个线程和 3 GHZ,则每秒有 240 亿个时钟周期。 ;)
但是,使用
int[]
来计算字符会快得多。仅有 65,536 个可能的字符。打印
即使是字数的 11 倍也只需要几分之一秒的时间。
更长的并行版本会更快一些。
但对于一个少于一百万字的字符串
来说,它可能不值得。
You shouldn't assume 900,000 is a lot of words. If you have a CPU with 8 threads and 3 GHZ that's 24 billion clock cycles per second. ;)
However for counting characters using an
int[]
will be much faster. There is only 65,536 possible characters.prints
Even 11x times the number of words takes a fraction of a second.
A much longer parallel version is a little faster.
prints
But for a String with less than a million words its not likely to be worth it.
作为一般规则,您应该以简单的方式编写内容,然后进行性能调整以使其尽可能快。
如果这意味着采用更快的算法,那就这样做,但首先要保持简单。
对于这样的小程序来说,并不会太难。
性能调优的基本技巧是而不是猜测。
相反,让程序本身告诉您要修复什么。
这是我的方法。
如需了解更多相关计划,像这个,经验会告诉你如何避免过度 -认为这最终会导致许多它试图避免的糟糕表现。
As a general rule, you should just write things in a straightforward way, and then do performance tuning to make it as fast as possible.
If that means putting in a faster algorithm, do so, but at first, keep it simple.
For a small program like this, it won't be too hard.
The essential skill in performance tuning is not guessing.
Instead, let the program itself tell you what to fix.
This is my method.
For more involved programs, like this one, experience will show you how to avoid the over-thinking that ends up causing a lot of the poor performance it is trying to avoid.
您必须使用分而治之的方法并避免资源竞争。对此有不同的方法和/或实现。想法是相同的 - 分割工作并并行处理。
在单台机器上,您可以在单独的线程中处理数据块,尽管将数据块放在同一磁盘上会大大减慢速度。 H 拥有更多的线程意味着有更多的上下文切换,恕我直言,吞吐量最好是拥有更少的线程并让它们保持忙碌。
您可以将处理分为多个阶段,并使用SEDA或类似的东西,并且对于map-reduce所做的真正大数据 - 只需计算跨集群分发数据的费用即可。
我会很高兴有人指出另一个广泛使用的 API。
You have to use divide and conquer approach and avoid race for resources. There are different approaches and/or implementations for that. The idea is the same - split the work and parallelize the processing.
On single machine you can process chunks of the data in separate threads, although having the chunks on the same disk will slow things down considerably. H having more threads means having more context-switching, for throughput is IMHO better to have smaller amount of them and keep them busy.
You can split the processing to stages and use SEDA or something similar and with really big data you do for map-reduce - just count with the expense of distributing data across cluster.
I'll be glad of somebody point to another widely-used API.