就磁盘空间使用而言,Cassandra 是否足够适合存储日志?

发布于 2024-09-07 00:04:49 字数 346 浏览 4 评论 0原文

我在分布式环境中每天存储 50Gb 的日志时遇到问题。我研究了 Hadoop HDFS,但因为它在 Windows 基础设施上运行存在问题,缺乏多语言文件系统 API,所以它不太适合我。另一方面,Cassandra 很容易在任何平台上部署。我面临的唯一大问题是磁盘空间使用情况。以下是数字:

  • 原始日志大小为 224Mb
  • Cassandra 数据文件为 557Mb
  • Cassandra 索引文件为 109Mb

因此,当从日志文件存储日志行时,我的开销几乎增加了 2 倍。

是否可以通过某种方式调整 Cassandra,使其在非常简单的场景中不会占用如此多的磁盘空间?

I've a problem storing 50Gb of logs each day in a distributed environment. I looked at Hadoop HDFS but because it has problems running on Windows infrastructure, lack of multi language filesystem API it doesn't suit me very well. Cassandra on the other hand is very easy to deploy on any platform. The only big problem I'm facing is a disk space usage. Here are the figures:

  • Original log size is 224Mb
  • Cassandra data file is 557Mb
  • Cassandra index file is 109Mb

So I got almost 2x overhead when storing log lines from a log file.

Is it possible to tune Cassandra in some way so it wont eat so much disk space for very simple scenarios?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

自在安然 2024-09-14 00:04:49

我猜你的意思是你的列族中的一行(四列)?与每列关联的“开销”是一个 long(时间戳,64 位)和一个 byte[](列名称,最大 64 kb)。所以 4 倍的磁盘使用量看起来有点奇怪。你有做删除操作吗?请务必了解如何以分布式、最终一致的方式完成删除系统

请务必阅读“压缩”。 (“一旦压缩完成,旧的 SSTable 文件可能会被删除”)

还想提醒您有关流式处理方式的 Thrift 限制。

Cassandra 的公共 API 基于 Thrift,它不提供流传输功能——任何写入或获取的值都必须适合内存。这是 Thrift 设计所固有的,因此不太可能改变。因此,向 Cassandra 添加大对象支持需要一个特殊的 API 来手动将大对象分割成多个部分。 http://issues.apache.org/jira/browse/ 中描述了一种潜在的方法CASSANDRA-265。同时,作为一种解决方法,您可以手动将文件拆分为您喜欢的任何大小的块(至少一个人使用 64MB),并使文件对应于一行,并将块作为列值。 (摘自 wiki 上的“Cassandra 限制”页面)

I guess you mean one row (with four columns) inside your column family? The "overhead" associated with each column is a a long (timestamp, 64 bits) and a byte[] (column name, max 64 kb). So 4x disk usage seems a little bit weird. Are you doing any deletes? Be sure to understand how deletes are done in a distributed, eventually consistent system.

Be sure to read about "compactions" also. ("Once compaction is finished, the old SSTable files may be deleted")

Would also like to remind you of a Thrift limitation regarding how streaming is done.

Cassandra's public API is based on Thrift, which offers no streaming abilities -- any value written or fetched has to fit in memory. This is inherent to Thrift's design and is therefore unlikely to change. So adding large object support to Cassandra would need a special API that manually split the large objects up into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265. As a workaround in the meantime, you can manually split files into chunks of whatever size you are comfortable with -- at least one person is using 64MB -- and making a file correspond to a row, with the chunks as column values. (From the 'Cassandra Limitations' page on the wiki)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文