当前位置：文江博客话题详情

用于用户文件的 Linux 数据仓库系统？

发布于 2024-07-30 17:23:24 字数 1542 浏览 9 评论 0原文

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

放飞的风筝 2024-08-06 17:23:25

S3 是一个有趣的想法。使用 cron 将超过 1 个月未访问的文件同步到 Amazon S3，然后创建一个 Web 界面供用户将同步的文件恢复到服务器。在将文件移动到 S3 之前以及恢复文件之后发送电子邮件。

http://s3tools.org/s3cmd

无限存储空间，只需按使用量付费。不完全是一个现有的开源项目，但组装起来也不太困难。

如果您需要良好的安全性，请在将文件推送到 Amazon 之前对文件进行 GPG 加密。 GPG 非常非常安全。

更昂贵的替代方案是将所有数据存储在本地。如果您不想购买大型磁盘集群或大型 NAS，您可以使用 HDFS：

http://hadoop.apache.org/common/docs/current/hdfs_design.html

并同步到行为类似于 S3 的集群。您可以使用商用硬件来扩展 HDFS。特别是如果您已经有几台旧机器和一个快速网络，这可能比严肃的 NAS 便宜得多，并且在大小上更具可扩展性。

祝你好运！我期待看到更多关于此问题的答案。

回复收藏 0 原文

逐鹿 2024-08-06 17:23:25

-请-不要将患者数据上传到 S3（至少不是我的）。

回复收藏 0 原文

半仙 2024-08-06 17:23:25

谷歌“开源‘文件生命周期管理’”。抱歉，我只知道商业 SAN 应用程序，不知道是否有 F/OSS 替代品。

商业应用程序的工作方式是文件系统显示正常——所有文件都存在。但是，如果该文件在一段时间内（对我们来说是 90 天）没有被访问，该文件将被移动到辅助存储。也就是说，除了前 4094 个字节之外的所有字节都被移动。文件归档后，如果您查找（读取）超过字节 4094，则在从辅助存储拉回文件时会出现轻微延迟。我猜测小于 4094 字节的文件永远不会发送到辅助存储，但我从未考虑过这一点。

此方案的唯一问题是，如果您碰巧有某个东西试图扫描您的所有文件（例如网络搜索索引）。这往往会从辅助存储中拉回所有内容，填满主存储，IT 人员就会开始对您虎视眈眈。（我是，咳咳，从一些轻微的经验中谈起。）

您可以尝试在 ServerFault.com 上询问这个问题。

如果您很方便，您也许可以使用 cron 和 shell 脚本想出类似的方法。您必须用符号链接替换 4094 字节的内容（请注意，下面的内容未经测试）。

# This is the server's local storage, available via network
SOURCE_STORAGE_PATH=/opt/network/mounted/path

# This is the remote big backup mount
TARGET_STORAGE_PATH=/mnt/remote/drive

# This is the number of days to start archiving files
DAYS_TO_ARCHIVE=90

# Find old files that are not already symlinks, using temp files
# NOTE: You might have to account for spaces in file names
TEMP_FILE=$(mktemp)
find ${SOURCE_STORAGE_PATH} -atime +${DAYS_TO_ARCHIVE} -a -not -type l > ${TEMP_FILE}

# This probably needs to change, if too many files in TEMP_FILE...
# this would be a good point to drop into something like Perl
for FILE in $(cat ${TEMP_FILE}); do
    # split source into path and file name
    BASE_PATH=$(dirname ${FILE});
    FILE_NAME=$(basename ${FILE})

    # path to target
    TARGET_PATH=${TARGET_STORAGE_PATH}/${BASE_PATH}
    # make sure target exists (note -p option to mkdir)
    [ -d "${TARGET_PATH}" ] || mkdir -p ${TARGET_PATH}
    # move source to target
    mv ${FILE} ${TARGET_PATH}
    # replace source with symlink to target
    ln -s ${TARGET_PATH}/${FILE_NAME} ${FILE}
done

Google 'open source "file lifecycle management"'. I'm sorry, I'm only aware of commercial SAN apps, not if there are F/OSS alternatives.

The way the commercial apps work is the filesystem appears normal -- all files are present. However, if the file has not been accessed in a certain period (for us, this is 90 days), the file is moved to secondary storage. That is, all but the first 4094 bytes are moved. After a file is archived, if you seek (read) past byte 4094 there is a slight delay while the file is pulled back in from secondary storage. I'm guessing files smaller than 4094 bytes are never sent to secondary storage, but I'd never thought about it.

The only problem with this scheme is if you happen to have something that tries to scan all of your files (a web search index, for example). That tends to pull everything back from secondary storage, fills up primary, and the IT folks start giving you the hairy eyeball. (I'm, ahem, speaking from some slight experience.)

You might try asking this over on ServerFault.com.

If you're handy, you might be able to come up with a similar approach using cron and shell scripts. You'd have to replace the 4094-byte stuff with symlinks (and note, the below is not tested).

# This is the server's local storage, available via network
SOURCE_STORAGE_PATH=/opt/network/mounted/path

# This is the remote big backup mount
TARGET_STORAGE_PATH=/mnt/remote/drive

# This is the number of days to start archiving files
DAYS_TO_ARCHIVE=90

# Find old files that are not already symlinks, using temp files
# NOTE: You might have to account for spaces in file names
TEMP_FILE=$(mktemp)
find ${SOURCE_STORAGE_PATH} -atime +${DAYS_TO_ARCHIVE} -a -not -type l > ${TEMP_FILE}

# This probably needs to change, if too many files in TEMP_FILE...
# this would be a good point to drop into something like Perl
for FILE in $(cat ${TEMP_FILE}); do
    # split source into path and file name
    BASE_PATH=$(dirname ${FILE});
    FILE_NAME=$(basename ${FILE})

    # path to target
    TARGET_PATH=${TARGET_STORAGE_PATH}/${BASE_PATH}
    # make sure target exists (note -p option to mkdir)
    [ -d "${TARGET_PATH}" ] || mkdir -p ${TARGET_PATH}
    # move source to target
    mv ${FILE} ${TARGET_PATH}
    # replace source with symlink to target
    ln -s ${TARGET_PATH}/${FILE_NAME} ${FILE}
done

回复收藏 0 原文

~没有更多了~

关于作者

往昔成烟

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

用于用户文件的 Linux 数据仓库系统？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

梦里南柯

不将就、

alipaysp_ZRaVhH1Dn

青衫儰鉨ミ守葔

故事未完

梦晓ヶ微光ヅ倾城

友情链接

用于用户文件的 Linux 数据仓库系统？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

梦里南柯

不将就、

alipaysp_ZRaVhH1Dn

青衫儰鉨ミ守葔

故事未完

梦晓ヶ微光ヅ倾城

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。