我正在尝试找出在文件系统中存储用户上传的文件的最佳方法。 文件范围从个人文件到 wiki 文件。 当然,数据库会以某种方式指向这些文件,但我还没有弄清楚。
基本要求:
- 相当好的安全性,人们无法猜测文件名
(图片001.jpg,图片002.jpg,
Music001.mp3 是一个大禁忌)
- 轻松备份和备份 可镜像(我更喜欢一种方式,这样我就不必每次想要备份时都复制整个 HDD。我喜欢只备份最新项目的想法,但我对此处的选项很灵活。)
- 可扩展到数百万个如果需要,多个服务器上的文件。
I'm trying to figure out the best way to store user uploaded files in a file system. The files range from personal files to wiki files. Of course, the DB will point to those files by someway which I have yet to figure out.
Basic Requirements:
- Fairy Decent Security so People Can't Guess Filenames
(Picture001.jpg, Picture002.jpg,
Music001.mp3 is a big no no)
- Easily Backed Up & Mirrorable (I prefer a way so I don't have to copy the entire HDD every single time I want to backup. I like the idea of backing up just the newest items but I'm flexible with the options here.)
- Scalable to millions of files on multiple servers if needed.
发布评论
评论(5)
一种技术是将数据存储在以其内容的哈希值 (SHA1) 命名的文件中。 这不容易猜到,任何备份程序都应该能够处理它,并且很容易进行分片(通过在一台机器上存储以 0 开头的哈希值,在另一台机器上存储以 1 开头的哈希值,等等)。
该数据库将包含用户分配的名称和内容的 SHA1 哈希值之间的映射。
One technique is to store the data in files named after the hash (SHA1) of their contents. This is not easily guessable, any backup program should be able to handle it, and it easily sharded (by storing hashes starting with 0 on one machine, hashes starting with 1 on the next, etc).
The database would contain a mapping between the user's assigned name and the SHA1 hash of the contents.
文件名指南,自动扩展文件夹层次结构,每个文件夹中的文件/文件夹不超过几千个。 备份新文件是通过备份新文件夹来完成的。
您没有指出您正在使用什么环境和/或编程语言,但这里有一个 C# / .net / Windows 示例:
Guids for filenames, automatically expanding folder hierarchy with no more than a couple of thousand files/folders in each folder. Backing up new files is done by backing up new folders.
You haven't indicated what environment and/or programming language you are using, but here's a C# / .net / Windows example:
文件名 + 盐的 SHA1 哈希值(或者,如果您愿意,可以是文件内容的 SHA1 哈希值。这使得检测重复文件变得更容易,但也给服务器带来了更大的压力)。 这可能需要一些调整才能独一无二(即添加上传的用户 ID 或时间戳),而盐是为了使其不可猜测。
文件夹结构由散列的部分组成。
例如,如果哈希值是“2fd4e1c67a2d28fced849ee1bb76e7391b93eb12”,那么文件夹可能是:
这是为了防止大型文件夹(某些操作系统无法枚举包含一百万个文件的文件夹,因此为部分哈希值创建几个子文件夹。多少级? 这取决于您期望有多少个文件,但 2 或 3 个通常是合理的。
SHA1 hash of the filename + a salt (or, if you want, of the file contents. That makes detecting duplicate files easier, but also puts a LOT more stress on the server). This may need some tweaking to be unique (i.e. add Uploaded UserID or a Timestamp), and the salt is to make it not guessable.
Folder structure is then by parts of the hash.
For example, if the hash is "2fd4e1c67a2d28fced849ee1bb76e7391b93eb12" then the folders could be:
This is to prevent large folders (some Operating Systems have trouble enumarating folders with a million of files, hence making a few subfolders for parts of the hash. How many levels? That depends on how many files you expect, but 2 or 3 is usually reasonable.
仅就您问题的一方面(安全性)而言:在文件系统中安全存储上传文件的最佳方法是确保上传文件不在网络根目录中(即,您无法通过 URL 直接访问它们 - 您必须通过脚本)。
这使您可以完全控制人们可以下载的内容(安全性)并允许进行日志记录等操作。 当然,您必须确保脚本本身是安全的,但这意味着只有您允许的人才能下载某些文件。
Just in terms of one aspect of your question (security): the best way to safely store uploaded files in a filesystem is to ensure the uploaded files are out of the webroot (i.e., you can't access them directly via a URL - you have to go through a script).
This gives you complete control over what people can download (security) and allows for things such as logging. Of course, you have to ensure the script itself is secure, but it means only the people you allow will be able to download certain files.
扩展 Phill Sacre 的答案,安全性的另一个方面是为上传的文件使用单独的域名(例如,维基百科使用 upload.wikimedia.org),并确保该域无法读取您的任何内容网站的 cookie。 这可以防止人们上传带有脚本的 HTML 文件来窃取用户的会话 cookie(仅设置 Content-Type 标头是不够的,因为 某些浏览器会忽略它并根据文件内容进行猜测;它也可以嵌入到其他类型的文件中,因此检查 HTML 并禁止它并不是一件容易的事) 。
Expanding on Phill Sacre's answer, another aspect of security is to use a separate domain name for uploaded files (for instante, Wikipedia uses upload.wikimedia.org), and make sure that domain cannot read any of your site's cookies. This prevents people from uploading a HTML file with a script to steal your users' session cookies (simply setting the Content-Type header isn't enough, because some browsers are known to ignore it and guess based on the file's contents; it can also be embedded in other kinds of files, so it's not trivial to check for HTML and disallow it).