将长字符串读入内存

发布于 2024-09-25 19:40:39 字数 104 浏览 0 评论 0原文

我有一个非常大的字符串,当我用 Java 读取它时,出现内存不足错误。实际上,我需要将所有这些字符串读入内存,然后分成单独的字符串并根据值对它们进行排序。最好的方法是什么?

谢谢

I am having a very large string, and when I read it in Java, I am getting out of memory error. Actually, I need to read all this string into memory and then split into individual strings and sort them based on value. What is the best way do this?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

征﹌骨岁月お 2024-10-02 19:40:39

你的大字符串从哪里来?正如你所说,你读过它,我假设它来自一个文件。您是否必须知道整个字符串才能知道在哪里拆分它?如果没有,您可以逐个字符地读取文件,直到遇到分割标记,将迄今为止读取的所有字符放入字符串中,然后开始读取下一个字符串。你大概知道在哪里对你刚刚读到的单个字符串进行排序吗?如果是这样,您可以在第一次运行时将部分字符串写入单独的文件(例如,当您按字母顺序对字符串进行排序时,以 A 开头的所有字符串都会转到 A.tmp)。之后,您可以对创建的文件内容(希望现在足够小以适合您的内存)进行排序,最后将内容附加到新的输出文件中。

Where does your large String come from? As you say you read it, I assume it comes from a file. Do you have to know the whole String to know where to split it? If not, you could just read the file char by char until you hit a split marker, put all the chars read so far in a String and begin reading the next String. Would you roughly know where to sort a single String you just read? If so, you could write the partial Strings to separate files (e.g. all Strings starting with A go to A.tmp when you sort your Strings alphabetically) in the first run. After that, you can sort the (hopefully now small enough to fit in your memory) created files' contents and finally append the contents to a new outputfile.

謸气贵蔟 2024-10-02 19:40:39

如果您受到内存限制,那么您可以尝试应用合并排序,否则使用虚拟机参数 -Xmx 和 -Xms 增加堆大小

If you are limited by memory then you could try applying merge sort else increase the heap size using virtual machine parameters -Xmx and -Xms

芸娘子的小脾气 2024-10-02 19:40:39

如果您希望 Hadoop “逐行”处理 100 GiB 的 apache 日志文件,您实际上所做的与您想要的相同:将大量文本分割成多个片段。

在 Hadoop 中执行此操作的正常方法(当您用此标记问题时)是使用 TextInputFormat 使用 LineRecordReader 使用 LineReader 用于分割文本文件“行尾”分隔符。你想要的本质上是相同的,但有一个区别:在不同的东西上分开。

对结果值进行排序(在 Hadoop 中)本质上是通过使用所谓的“二次排序”来完成的(查看 Hadoop 示例Tom 的解释书)。

所以我建议做的是

  1. TextInputFormat/LineRecordReader/LineReader 根据分隔符读取并提取字符串的各个部分。
  2. 创建一个重写信息的映射以进行二次排序。
  3. 创建正确的分区、组和键比较器类/方法来进行排序。
  4. 创建一个reduce,您可以在其中接收排序后的信息,您可以进一步处理这些信息。

华泰

If you want Hadoop to process a 100 GiB apache logfile "line by line" you are essentially doing the same as what you want: A large body of text split into pieces.

The normal way for doing that in Hadoop (as you tagged the question with this) is using the TextInputFormat which uses LineRecordReader which uses LineReader to split the Text file on the "end-of-line" separator. What you want is essentially the same with one difference: split on something different.

Sorting the resulting values (in Hadoop) is essentially done by employing what is called "Secondary Sort" (See the Hadoop example and the explanation in Tom's book).

So what I would recommend doing is

  1. Make your own variation on TextInputFormat/LineRecordReader/LineReader that reads and extracts the individual parts of your String based on you separator.
  2. Create a map that rewrites the information to do Secondary Sort.
  3. Create the correct partition, group and key comparator classes/methods to do the sorting.
  4. Create a reduce where you receive the sorted information which you can the process further.

HTH

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文