用于用户文件的 Linux 数据仓库系统?

发布于 2024-07-30 17:23:24 字数 1542 浏览 9 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

放飞的风筝 2024-08-06 17:23:25

S3 是一个有趣的想法。 使用 cron 将超过 1 个月未访问的文件同步到 Amazon S3,然后创建一个 Web 界面供用户将同步的文件恢复到服务器。 在将文件移动到 S3 之前以及恢复文件之后发送电子邮件。

无限存储空间,只需按使用量付费。 不完全是一个现有的开源项目,但组装起来也不太困难。

如果您需要良好的安全性,请在将文件推送到 Amazon 之前对文件进行 GPG 加密。 GPG 非常非常安全。

更昂贵的替代方案是将所有数据存储在本地。 如果您不想购买大型磁盘集群或大型 NAS,您可以使用 HDFS:

并同步到行为类似于 S3 的集群。 您可以使用商用硬件来扩展 HDFS。 特别是如果您已经有几台旧机器和一个快速网络,这可能比严肃的 NAS 便宜得多,并且在大小上更具可扩展性。

祝你好运! 我期待看到更多关于此问题的答案。

S3 is an interesting idea here. Use cron to sync files that are not accessed for over 1 month up to Amazon's S3, then create a web interface for users to restore the sync'd files back to the server. Send emails before you move files to S3 and after they are restored.

Limitless storage, only pay for what you use. Not quite an existing open-source project, but not too tough to assemble.

If you need good security, wrap the files in GPG encryption before pushing them to Amazon. GPG is very, very safe.

A more expensve alternative is to store all the data locally. If you don't want to buy a large disk cluster or big NAS, you could use HDFS:

And sync to a cluster that behaves similar to S3. You can scale HDFS with commodity hardware. Especially if you have a couple old machines and a fast network already laying around, this could be much cheaper than serious NAS, as well as more scalable in size.

Good luck! I look forward to seeing more answers on this.

逐鹿 2024-08-06 17:23:25

-请-不要将患者数据上传到 S3(至少不是我的)。

-Please- do not upload patient data to S3 (at least not mine).

半仙 2024-08-06 17:23:25

谷歌“开源‘文件生命周期管理’”。 抱歉,我只知道商业 SAN 应用程序,不知道是否有 F/OSS 替代品。

商业应用程序的工作方式是文件系统显示正常——所有文件都存在。 但是,如果该文件在一段时间内(对我们来说是 90 天)没有被访问,该文件将被移动到辅助存储。 也就是说,除了前 4094 个字节之外的所有字节都被移动。 文件归档后,如果您查找(读取)超过字节 4094,则在从辅助存储拉回文件时会出现轻微延迟。 我猜测小于 4094 字节的文件永远不会发送到辅助存储,但我从未考虑过这一点。

此方案的唯一问题是,如果您碰巧有某个东西试图扫描您的所有文件(例如网络搜索索引)。 这往往会从辅助存储中拉回所有内容,填满主存储,IT 人员就会开始对您虎视眈眈。 (我是,咳咳,从一些轻微的经验中谈起。)

您可以尝试在 ServerFault.com 上询问这个问题。

如果您很方便,您也许可以使用 cron 和 shell 脚本想出类似的方法。 您必须用符号链接替换 ​​4094 字节的内容(请注意,下面的内容未经测试)。

# This is the server's local storage, available via network
SOURCE_STORAGE_PATH=/opt/network/mounted/path

# This is the remote big backup mount
TARGET_STORAGE_PATH=/mnt/remote/drive

# This is the number of days to start archiving files
DAYS_TO_ARCHIVE=90

# Find old files that are not already symlinks, using temp files
# NOTE: You might have to account for spaces in file names
TEMP_FILE=$(mktemp)
find ${SOURCE_STORAGE_PATH} -atime +${DAYS_TO_ARCHIVE} -a -not -type l > ${TEMP_FILE}

# This probably needs to change, if too many files in TEMP_FILE...
# this would be a good point to drop into something like Perl
for FILE in $(cat ${TEMP_FILE}); do
    # split source into path and file name
    BASE_PATH=$(dirname ${FILE});
    FILE_NAME=$(basename ${FILE})

    # path to target
    TARGET_PATH=${TARGET_STORAGE_PATH}/${BASE_PATH}
    # make sure target exists (note -p option to mkdir)
    [ -d "${TARGET_PATH}" ] || mkdir -p ${TARGET_PATH}
    # move source to target
    mv ${FILE} ${TARGET_PATH}
    # replace source with symlink to target
    ln -s ${TARGET_PATH}/${FILE_NAME} ${FILE}
done

Google 'open source "file lifecycle management"'. I'm sorry, I'm only aware of commercial SAN apps, not if there are F/OSS alternatives.

The way the commercial apps work is the filesystem appears normal -- all files are present. However, if the file has not been accessed in a certain period (for us, this is 90 days), the file is moved to secondary storage. That is, all but the first 4094 bytes are moved. After a file is archived, if you seek (read) past byte 4094 there is a slight delay while the file is pulled back in from secondary storage. I'm guessing files smaller than 4094 bytes are never sent to secondary storage, but I'd never thought about it.

The only problem with this scheme is if you happen to have something that tries to scan all of your files (a web search index, for example). That tends to pull everything back from secondary storage, fills up primary, and the IT folks start giving you the hairy eyeball. (I'm, ahem, speaking from some slight experience.)

You might try asking this over on ServerFault.com.

If you're handy, you might be able to come up with a similar approach using cron and shell scripts. You'd have to replace the 4094-byte stuff with symlinks (and note, the below is not tested).

# This is the server's local storage, available via network
SOURCE_STORAGE_PATH=/opt/network/mounted/path

# This is the remote big backup mount
TARGET_STORAGE_PATH=/mnt/remote/drive

# This is the number of days to start archiving files
DAYS_TO_ARCHIVE=90

# Find old files that are not already symlinks, using temp files
# NOTE: You might have to account for spaces in file names
TEMP_FILE=$(mktemp)
find ${SOURCE_STORAGE_PATH} -atime +${DAYS_TO_ARCHIVE} -a -not -type l > ${TEMP_FILE}

# This probably needs to change, if too many files in TEMP_FILE...
# this would be a good point to drop into something like Perl
for FILE in $(cat ${TEMP_FILE}); do
    # split source into path and file name
    BASE_PATH=$(dirname ${FILE});
    FILE_NAME=$(basename ${FILE})

    # path to target
    TARGET_PATH=${TARGET_STORAGE_PATH}/${BASE_PATH}
    # make sure target exists (note -p option to mkdir)
    [ -d "${TARGET_PATH}" ] || mkdir -p ${TARGET_PATH}
    # move source to target
    mv ${FILE} ${TARGET_PATH}
    # replace source with symlink to target
    ln -s ${TARGET_PATH}/${FILE_NAME} ${FILE}
done
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文