可扩展的图像存储
我目前正在为基于 Web 的应用程序设计一个架构,该应用程序还应该提供某种图像存储。用户将能够上传照片作为该服务的关键功能之一。查看这些图像也将是主要用途之一(通过网络)。
但是,我不确定如何在我的应用程序中实现这样一个可扩展的图像存储组件。我已经考虑过不同的解决方案,但由于缺少经验,我期待听到您的建议。除了图像之外,还必须保存元数据。 以下是我的初步想法:
使用像 HDFS 这样的(分布式)文件系统,并准备专用的网络服务器作为“文件系统客户端”,以保存上传的图像和服务请求。图像元数据保存在附加数据库中,包括每个图像的文件路径信息。
在 HDFS 之上使用面向 BigTable 的系统(例如 HBase),并将图像和元数据保存在一起。同样,网络服务器桥接图像上传和请求。
使用完全无模式的数据库(例如 CouchDB)来存储图像和元数据。此外,通过基于 HTTP 的 RESTful API,使用数据库本身进行上传和分发。 (附加问题:CouchDB 确实通过 Base64 保存 blob。但是它可以以 image/jpeg 等形式返回数据吗?)
I'm currently designing an architecture for a web-based application that should also provide some kind of image storage. Users will be able to upload photos as one of the key feature of the service. Also viewing these images will be one of the primary usages (via web).
However, I'm not sure how to realize such a scalable image storage component in my application. I already thought about different solutions but due to missing experiences, I look forward to hear your suggestions. Aside from the images, also meta data must besaved.
Here are my initial thoughts:
Use a (distributed) filesystem like HDFS and prepare dedicated webservers as "filesystem clients" in order to save uploaded images and service requests. Image meta data are saved in a additional database including the filepath information for each image.
Use a BigTable-oriented system like HBase on top of HDFS and save images and meta data together. Again, webservers bridge image uploads and requests.
Use a completly schemaless database like CouchDB for storing both images and metadata. Additionally, use the database itself for upload and delievery by using the HTTP-based RESTful API. (Additional question: CouchDB does save blobs via Base64. Can it however return data in form of image/jpeg etc.)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
为此,我们一直使用 CouchDB,将图像保存为“附件”。但一年后,数十 GB 的 CouchDB 数据库文件变得令人头疼。例如,如果您将 CouchDB 复制用于非常大的文档,那么它仍然存在问题。
因此,我们只是重写了我们的软件,使用 CouchDB 来存储图像信息,并使用 Amazon S3 来存储实际的图像。该代码位于 http://github.com/hudora/huImages
您可能想要设置一个为您的项目提供兼容 Amazon S3 的现场存储服务。这使您保持灵活性,并且暂时无需外部服务即可选择亚马逊。 Walruss 似乎成为最流行和可扩展的 S3 克隆。
我还敦促您研究 Livejournal 的设计及其出色的开源 MogileFS 和 Perlbal 产品。 这种组合可能是最著名的图像服务设置。
此外 flickr 架构 也可以成为一种灵感,尽管他们不向公众提供开源软件,例如Livejournal 确实如此。
We have been using CouchDB for that, saving images as an "Attachment". But after a year the multi-dozen GB CouchDB Database files turned out to be a headache. For example CouchDB replication still has issues if you use it with very large document sizes.
So we just rewrote our software to use CouchDB for image information and Amazon S3 for the actual image storage. The code is available at http://github.com/hudora/huImages
You might want to set up a Amazon S3 compatible Storage Service on-site for your project. This keeps you flexible and leaves the amazon option without requiring external services for now. Walruss seems to become the most popular and scalable S3 clone.
I also urge you to look into the Design of Livejournal with their excellent Open Source MogileFS and Perlbal offerings. This combination is probably the most Famous image serving setup.
Also the flickr Architecture can be an inspiration, although they don't offer Open Source software to the public, like Livejournal does.
“附加问题:CouchDB 确实通过 Base64 保存 blob。”
CouchDB不将 blob 保存为 Base64,它们存储为直接二进制文件。当使用
?attachments=true
检索 JSON 文档时,我们确实将磁盘上的二进制文件转换为 Base64,以便将其安全地添加到 JSON,但这只是表示级别的事情。请参阅独立附件。
CouchDB 以存储附件的内容类型提供附件,实际上可以将 HTML、CSS 和 GIF/PNG/JPEG 附件直接服务器到浏览器。
附件可以流式传输,在 CouchDB 1.1 中,甚至支持 Range 标头(用于媒体流式传输和/或恢复中断的下载)。
"Additional question: CouchDB does save blobs via Base64."
CouchDB does not save blobs as Base64, they are stored as straight binary. When retrieving a JSON document with
?attachments=true
we do convert the on-disk binary to Base64 in order to add it safely to JSON but that's just a presentation level thing.See Standalone Attachments.
CouchDB serves attachments with the content-type they are stored with, it's possible, in fact common, to server HTML, CSS and GIF/PNG/JPEG attachments directly to browsers.
Attachments can be streamed and, in CouchDB 1.1, even support the Range header (for media streaming and/or resumption of an interrupted download).
使用 Seaweed-FS(以前称为 Weed-FS),它是 Facebook haystack 论文的实现。
Seaweed-FS 非常灵活并且精简至基础。它的创建是为了存储数十亿张图像并快速提供服务。
Use Seaweed-FS (used to be called Weed-FS), an implementation of Facebook's haystack paper.
Seaweed-FS is very flexible and pared down to the basics. It was created to store billions of images and serve them fast.
您考虑过亚马逊网络服务吗? S3是基于Web的文件存储,SimpleDB是键->属性存储。两者都具有高性能且高度可扩展。它比维护自己的服务器和设置更昂贵(假设您要自己做而不是雇用人员),但您的启动和运行速度要快得多。
编辑:我收回这一点 - 从长远来看,大批量时它更昂贵,但对于小批量来说,它超过了购买硬件的初始成本。
S3:http://aws.amazon.com/s3/(您可以存储图像文件在这里,为了提高性能,您的服务器上可能有图像缓存,也可能没有)
SimpleDB:http://aws .amazon.com/simpledb/ (元数据可以转到此处:图像 id 映射到您要存储的任何数据)
编辑 2:我什至不知道这一点,但有一个名为 Amazon CloudFront 的新 Web 服务(http://aws.amazon.com/cloudfront/)。它用于快速 Web 内容交付,并且与 S3 集成良好。有点像 Akamai 的图像。您可以使用它来代替图像缓存。
Have you considered Amazon Web Services? S3 is web-based file storage, and SimpleDB is a key->attribute store. Both are performant and highly scalable. It's more expensive than maintaining your own servers and setups (assuming you are going to do it yourself and not hire people), but you get up and running much more quickly.
Edit: I take that back - its more expensive in the long run at high volumes, but for low volume it beats the initial cost of buying hardware.
S3: http://aws.amazon.com/s3/ (you could store your image files here, and for performance maybe have an image cache on your server, or maybe not)
SimpleDB: http://aws.amazon.com/simpledb/ (metadata could go here: image id mapping to whatever data you want to store)
Edit 2: I didn't even know about this, but there is a new web service called Amazon CloudFront (http://aws.amazon.com/cloudfront/). It is for fast web content delivery, and it integrates well with S3. Kind of like Akamai for your images. You could use this instead of the image cache.
我们使用 MogileFS。我们是小规模用户,拥有不到 8TB 的空间和大约 5000 万个文件。几年前,我们从 Amazon S3 中存储转向更好地控制文件名和性能。
它不是最漂亮的软件,但它经过了“现场测试”,基本上所有用户都以与您相同的方式使用它。
We use MogileFS. We're small scale users with less than 8TB and some 50 million files. We switched from storing in Amazon S3 some years ago to get better control of file names and performance.
It's not the prettiest software, but it's very "field tested" and basically all users are using it the same way you will be.
作为 Cloudant 的一部分,我不想推销产品......但 BigCouch 在我的科学应用程序堆栈中解决了这个问题(物理 - 与 Cloudant 无关,当然与利润无关!)。它将 CocuhDB 设计的简单性与单服务器 CouchDB 所缺少的自动分片和可扩展性结合在一起。我一般用它来存储少量的大文件(多GB)和大量的小文件(100MB或更少)。我使用的是 S3,但对于重复访问的小文件,获取成本实际上开始增加。
As part of Cloudant, I don't want to push product.... but BigCouch solves this problem in my science application stack (physics -- nothing to do with Cloudant, and certainly nothing to do with profit!). It marries the simplicity of the CocuhDB design with the auto-sharding and scalability that is missing in single-server CouchDB. I generally use it to store a smaller number of big file (multi-GB) and a large number of small file (100MB or less). I was using S3 but the get costs actually start to add up for small files that are repeatedly accessed.
也许可以看看 Facebook hayStack 的描述
大海捞针:高效存储数十亿照片
Maybe have a look at the description of Facebook hayStack
Needle in a haystack: efficient storage of billions of photos
好吧,如果所有这些 AWS 东西都不起作用,这里有一些想法。
至于(3),如果将二进制数据放入数据库,将会输出相同的数据。使其成为 jpeg 的是数据的格式,而不是数据库认为的格式。当您将
Content-type
标头设置为image/jpeg
时,客户端(网络浏览器)就会认为它是 jpeg。您还可以将其设置为其他内容(不推荐),例如文本,这就是浏览器尝试解释它的方式。对于磁盘存储,我喜欢 CouchDB,因为它简单,但 HDFS 肯定也可以。以下是有关从 CouchDB 提供图像内容的帖子的链接: http://japhr.blogspot.com/2009/04/render-couchdb-images-via-sinatra.html
编辑:这里有一个关于在 memcached 中缓存图像与从磁盘提供图像的有用讨论的链接linux/阿帕奇。
Ok, if all that AWS stuff isn't going to work, here are a couple of thoughts.
As far as (3), if you put binary data into a database, the same data is going to come out. What makes it a jpeg is the format of the data, not what the database thinks it is. What makes the client (web browser) think its a jpeg is when you set the
Content-type
header toimage/jpeg
. You could also set it to something else (not recommended) like text and that's how the browser would try to interpret it.For on-disk storage, I like CouchDB for its simplicity, but HDFS would certainly work. Here's a link to a post about serving image content from CouchDB: http://japhr.blogspot.com/2009/04/render-couchdb-images-via-sinatra.html
Edit: here's a link to a useful discussion about caching images in memcached vs serving them from disk under linux/apache.
我一直在我的 Python 视图服务器中尝试一些可用于 CouchDB 视图服务器的 _update 功能。
我做的一件非常酷的事情是图像上传的更新功能,这样我就可以使用 PIL 创建缩略图和其他相关图像,并在将它们推送到 CouchDB 时将它们附加到文档中。
如果您需要图像处理并希望减少需要维护的代码量和基础设施,这可能会很有用。
I've been experimenting with some of the _update functionality available to CouchDB view servers in my Python view server.
One really cool thing I did was an update function for image uploads so that I could use PIL to create thumbnails and other related images and attach them to the document when they get pushed to CouchDB.
This might be useful if you need image manipulation and want to cut down on the amount of code and infrastructure you need to keep up.
我在 cassandra 之上编写了图像存储。我们有很多写入和随机读取,读/写很低。对于高读/写比,我建议使用 mongodb (GridFs)。
I've written image store on top of cassandra . We have a lot and writes and random reads read/write is low. For high read/write ratio I suggest You mongodb (GridFs).
以下是使用 PHP Laravel 在 CouchDB 中存储 blob 图像的示例。
在此示例中,我根据用户要求存储三个图像。
在 CouchDB 中建立连接。
与存储单个图像相同。
Here is an example to store blob image in CouchDB using PHP Laravel.
In this example, I am storing three images based on user requirements.
Establishing the connection in CouchDB.
same as you can store single image.