存储数百万个日志文件 - 每年约 25 TB
作为我工作的一部分,我们每年获得大约 25TB 的日志文件,目前它保存在基于 NFS 的文件系统上。有些以 zipped/tar.gz 的形式存档,而另一些则以纯文本格式保存。
我正在寻找使用基于 NFS 的系统的替代方案。我研究了 MongoDB、CouchDB。事实上,它们是面向文档的数据库,这一事实似乎使其非常适合。但是,日志文件内容需要更改为 JSON 才能存储到数据库中。有件事我不愿意做。我需要按原样保留日志文件内容。
至于使用,我们打算放置一个小型 REST API,并允许人们获取文件列表、最新文件以及获取文件的能力。
所提出的解决方案/想法需要是应用程序级别的某种形式的分布式数据库或文件系统,其中可以存储日志文件,并且可以通过添加更多机器来有效地水平扩展。
安库尔
As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.
I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.
As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.
The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.
Ankur
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
由于您不需要查询功能,因此可以使用 apache hadoop。
我相信 HDFS 和 HBase 非常适合这个。
您可以在 Hadoop powered by 页面中看到许多巨大的存储故事
Since you dont want queriying features, You can use apache hadoop.
I belive HDFS and HBase will be nice fit for this.
You can see lot of huge storage stories inside Hadoop powered by page
看一下 Vertica,这是一个支持并行处理和快速查询的列式数据库。 Comcast 使用它来分析大约 15GB/天的 SNMP 数据,平均运行速度使用五台四核 HP Proliant 服务器,采样率达到每秒 46,000 个样本。几周前,我听到康卡斯特的一些运营人员对 Vertica 赞不绝口;他们仍然非常喜欢它。它有一些很好的数据压缩技术和“k-安全冗余”,因此它们可以省去 SAN。
更新:可扩展分析数据库方法的主要优点之一是您可以对日志进行一些非常复杂的、准实时的查询。这对于您的运营团队来说可能非常有价值。
Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.
Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.
你试过看 gluster 吗?它具有可扩展性,提供复制和许多其他功能。它还为您提供标准文件操作,因此无需实现另一个 API 层。
http://www.gluster.org/
Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.
http://www.gluster.org/
我强烈不建议使用键/值或基于文档的存储来存储这些数据(mongo、cassandra 等)。使用文件系统。这是因为文件太大,并且访问模式将是线性扫描。您将遇到的一件事是保留。大多数“NoSQL”存储系统都使用逻辑删除,这意味着您必须压缩数据库才能删除已删除的行。如果您的单个日志记录很小并且您必须为每一个日志记录建立索引,那么您还会遇到问题 - 您的索引将非常大。
将数据放入 HDFS 中,采用 64 MB 块的 2-3 路复制,格式与现在相同。
I would strongly disrecommend using a key/value or document based store for this data (mongo, cassandra, etc.). Use a file system. This is because the files are so large, and the access pattern is going to be linear scan. One thing problem that you will run into is retention. Most of the "NoSQL" storage systems use logical delete, which means that you have to compact your database to remove deleted rows. You'll also have a problem if your individual log records are small and you have to index each one of them - your index will be very large.
Put your data in HDFS with 2-3 way replication in 64 MB chunks in the same format that it's in now.
如果您要选择文档数据库:
在 CouchDB 上,您可以使用 _attachment API 将文件按原样附加到文档,文档本身可以仅包含用于索引的元数据(如时间戳、位置等)。然后您将拥有用于文档和附件的 REST API。
Mongo 的 GridF 也可以采用类似的方法,但您需要自己构建 API。
HDFS也是一个非常好的选择。
If you are to choose a document database:
On CouchDB you can use the _attachement API to attach the file as is to a document, the document itself could contain only metadata (like timestamp, locality and etc) for indexing. Then you will have a REST API for the documents and the attachments.
A similar approach is possible with Mongo's GridFs, but you would build the API yourself.
Also HDFS is a very nice choice.