在 Java 中将数字存储为 ASCII 文本?

发布于 2024-11-05 17:35:50 字数 312 浏览 0 评论 0原文

这可能是一个愚蠢的问题,但事情是这样的。我正在读这个问题:

存储 100 万个电话号码

,接受的问题是我正在想:使用特里树。 Matt Ball 在评论中建议:

我认为将电话号码存储为 ASCII 文本并进行压缩是一个非常合理的建议

问题:如何在 Java 中做到这一点? ASCII 文本确实代表字符串吗?

It's probably a stupid question but here's the thing. I was reading this question:

Storing 1 million phone numbers

and the accepted question was what I was thinking: using a trie. In the comments Matt Ball suggested:

I think storing the phone numbers as ASCII text and compressing is a very reasonable suggestion

Problem: how do I do that in Java? And ASCII text does stand for String?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

辞慾 2024-11-12 17:35:51

是的,在这种情况下 ASCII 表示字符串。您可以使用 java.lang.Java 来存储压缩数据。 util.zip.GZIPOutputStream

Yes, ASCII means Strings in this case. You can store compressed data in Java using the java.util.zip.GZIPOutputStream.

是你 2024-11-12 17:35:51

回答一个隐含的但不同的问题;

问:您有 10 亿个电话号码,需要通过低带宽连接发送这些号码。您只需发送该电话号码是否在集合中即可。 (不需要其他信息)

A:这是一般方法,

  • 如果列表尚未排序,则首先对列表进行排序。
  • 从最小的数字中找到连续数字的区域。发送区域的起始位置和被占用的电话。这可以存储一个 BitSet(每个可能的号码 1 位),在开始时发送电话号码,并在间隙超过某个阈值时发送 BitSet。
  • 将流写入压缩数据集。
  • 对此进行测试,与简单发送所有数字进行比较。

您可以在排序的 TreeMap 中使用字符串。 100 万个数字并不算多,大约需要 64 MB。我认为不需要更复杂的解决方案。

最新版本的 Java 可以通过使用 byte[] 而不是 char[] 来有效存储 ASCII 文本,但是,数据结构的开销可能会更大。

如果您需要将电话号码存储为键,则可以假设大范围是连续的来存储它们。因此,您可以像这样存储它们:

NavigableMap<String, PhoneDetails[]>

在此结构中,键将定义范围的开始,并且您可以拥有每个号码的电话详细信息。这可能不会比对 PhoneDetails 的引用大多少(这是最小值)

顺便说一句:如果您不需要访问数据,您可以发明非常有效的结构。如果您从不访问数据,请不要将其保留在内存中,事实上您可以丢弃它,因为它永远不会被需要。


很大程度上取决于您想对数据做什么以及为什么将其存储在内存中。

您可以使用 DeflatorOutputStream 到 ByteArrayOutputStream,它会非常小,但不是很有用。

我建议使用 DeflatorOutputStream,因为它比 GZIPOutputStream 更轻/更快/更小。

In answer to an implied, but different question;

Q: You have 1 billion phones numbers and you need to send these over a low bandwidth connection. You only need to send whether the phone number is in the collection or not. (No other information required)

A: This is the general approach

  • First sort the list if its not sorted already.
  • From the lowest number find regions of continuous numbers. Send the start of the region and the phones which are taken. This can be stored a BitSet (1-bit per possible number) Send the phone number at the start and the BitSet whenever the gap is more than some threshold.
  • Write the stream to a compressed data set.
  • Test this to compare with a simple sending of all numbers.

You can use Strings in a sorted TreeMap. One million numbers is not very much and will use about 64 MB. I don't see the need for a more complex solution.

The latest version of Java can store ASCII text efficiently by using a byte[] instead of a char[] however, the overhead of your data structure is likely to be larger.

If you need to store a phone numbers as a key, you could store them with the assumption that large ranges will be continous. As such you could store them like

NavigableMap<String, PhoneDetails[]>

In this structure, the key would define the start of the range and you could have a phone details for each number. This could be not much bigger than the reference to the PhoneDetails (which is the minimum)

BTW: You can invent very efficient structures if you don't need access to the data. If you never access the data, don't keep it in memory, in fact you can just discard it as it won't ever be needed.


Alot depending on what you want to do with the data and why you have it in memory at all.

You can Use DeflatorOutputStream to a ByteArrayOutputStream, which will be very small, but not very useful.

I suggest using DeflatorOutputStream as its more light weight/faster/smaller than GZIPOutputStream.

归途 2024-11-12 17:35:51

Java字符串默认是UTF-8编码,你必须如果您想操作 ASCII 文本,更改编码

Java String are by default UTF-8 encoded, you have to change the encoding if you want to manipulate ASCII text.

恋竹姑娘 2024-11-12 17:35:50

对于问题中所示的内存存储:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(
    new GZIPOutputStream(baos), "US-ASCII");
for(String number : numbers){
    out.write(number);
    out.write('\n');
}
byte[] data = baos.toByteArray();

但正如 Pete 所说:这可能有利于内存效率,但之后您无法真正对数据进行任何操作,因此它并不是很有用。

For in-memory storage as indicated in the question:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(
    new GZIPOutputStream(baos), "US-ASCII");
for(String number : numbers){
    out.write(number);
    out.write('\n');
}
byte[] data = baos.toByteArray();

But as Pete remarked: this may be good for memory efficiency, but you can't really do anything with the data afterwards, so it's not really very useful.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文