随时间推移的网站快照
我是营销团队的开发人员,经常被要求的功能之一是:我们可以回去看看我们的网站(或 X 页面)在 X 中是什么样子吗?
有没有好的解决方案可以解决这个问题要求?
I'm a developer for a marketing team and one of the features that often gets requested is: Can we go back to see what our site (or what X page) looked like back in X.
Are there any good solutions for solving for this request?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
源代码控制应该能够在内部解决您的请求。 适当地标记事物并有一个内部服务器来部署该标签,并且您应该没有问题。 如果您有一个自动化部署工具并明智地选择标签,那么编写一个应用程序应该相对简单,该应用程序将在标签 X 处检查源并部署它,只需让用户输入标签即可。 现在,如果您的标签类似于日期,他们只需以正确的格式输入日期,然后等待 5 分钟即可进行部署。
Source Control should be able to solve your request in house. Label things appropriately and have an internal server to deploy that label to, and you should have no issue. If you have an automated deployment tool and choose your labels wisely, it should be relatively simple to write an app that will check out your source at label X and deploy it, by only having a user enter the label. Now if your labels we something like the date, they would just have to enter the date in the correct format and wait 5 minutes for the deploy.
看看回程机器它并不完美,但有一些令人尴尬的旧东西我工作过的网站仍然在那里:)
have a look at the way back machine it's not perfect, but there are some embarrasing old sites still in there that I worked on :)
你看过 archive.org 上的 Wayback Machine 吗?
http://www.archive.org/web/web.php
如果没有如果不能满足您的需求,也许您可以使用源代码控制存储库来自动化某些操作,该存储库可以提取特定日期的版本。
Have you looked at the wayback machine at archive.org?
http://www.archive.org/web/web.php
If that doesn't meet your needs, maybe you could automate something with your source control repository that could pull a version for a specific date.
与其他人的建议类似,(假设是动态网站)我将使用输出缓存来生成网页的代码,然后使用 Subversion 来跟踪更改。
使用 WayBack 机器可能只是最后的手段,例如,如果有人在您设置此系统之前要求查看网页。 人们不能依赖 WayBack Machine 来容纳所需的一切。
Similar to what others have suggested, (assuming a dynamic website) I would use output caching to generate the web page's code, and then use Subversion to track the changes.
Using the WayBack machine is probably only a last resort, such as if an individual asks to see a webpage from before you set this system up. One cannot rely on the WayBack Machine to contain everything that one needs.
我的建议是每天晚上在网站上运行 wget 并将其存储在
上archive.yourdomain.com
。 向每个页面添加一个控件,以便具有将当前页面的 URL 传递给日期选择器的适当权限。 选择日期后,加载archive.yourdomain.com/YYYYMMDD/original_url
。让用户在
archive.yourdomain.com
上浏览整个网站而不出现损坏的链接可能需要重写一些 URL 或将网站的存档副本从某个存储库复制到archive.yourdomain 的根目录.com
。 为了节省磁盘空间,这可能是最好的选择。 存储压缩的wget
副本,然后提取用户请求的日期。 这有一些问题,例如您如何处理多个用户想要同时查看不同日期的多个存档页面等。我建议每个用户在您的网站上运行
wget
night 优于从源代码管理中检索它,因为您将获得向 WWW 访问者显示的页面,包括任何动态提供的内容、错误、遗漏、随机轮换的广告等。编辑:您可以存储
wget
源代码管理中的输出,我不确定将其压缩到源代码管理之外的文件系统上会带来什么。 另请注意,假设任何大小的网站,随着时间的推移,此计划都会使用大量磁盘空间。My suggestion would be to simply run wget over the site every night and store that on
archive.yourdomain.com
. Add a control to each page for those with the appropriate permissions that passes the URL of the current page to a date picker. Once a date is chosen loadarchive.yourdomain.com/YYYYMMDD/original_url
.Letting users browse the entire site without broken links on
archive.yourdomain.com
might require some URL re-writing or copying the archived copy of the site from some respository to the root ofarchive.yourdomain.com
. To save disk space, that might be the best option. Store thewget
copies zipped, then extract the date the user requests. There are some issues with this, such as how do you deal with multiple users wanting to view multiple archived pages from different dates at the same time, etc.I'd suggest that running
wget
over your site each night is superior to retrieving it from source control since you would obtain the page as it was shown to WWW visitors, complete with any dynamically served content, errors, omissions, random rotated ads, etc.EDIT: You could store the
wget
output in source control, I'm not sure what that would buy you over zipping it up on a file system somewhere outside source control. Also note this plan would use up large amounts of disk space over time assuming a website of any size.正如 Grant 所说,您可以将 wget 与版本控制结合起来以节省空间。 实际上,我正在尝试编写一个脚本来为我的日常浏览执行此操作,因为我不相信 Internet Archive 或 WebCite 会无限期地存在(而且它们不太容易搜索)。
该脚本将如下所示: cd to directory; 调用正确的 wget --mirror 命令或其他命令; 运行
darcs add $(find .)
以将任何新文件签入存储库; 然后darcs record --all
。Wget 应该用更新的版本覆盖任何更改的文件; darcs add 将记录任何新文件/目录; darcs 记录将保存更改。
要获取截至日期 X 的视图,您只需从存储库中提取截至日期 X 的所有补丁即可。
您不会无限期地存储许多重复副本,因为 DVCS 不会保存历史记录,除非文件内容发生实际更改。 您将得到“垃圾”,即页面更改为不再需要 CSS 或 JS 或您之前下载的图像,但您可以定期删除所有内容并将其记录为补丁,并且下一次 wget 调用将仅提取需要的内容需要最新版本的网页。 (您仍然可以进行全文搜索,只是现在您搜索历史记录而不是磁盘上的文件。)
(如果正在下载大媒体文件,您可以输入类似
rm $(find . -size +2M)
在它们被darcs add
ed 之前删除它们。)编辑:我最终没有费心进行显式版本控制,而是让 wget 创建重复项,并偶尔用 < 删除它们代码>fdupes。 请参阅http://www.gwern.net/Archiving%20URLs
As Grant says, you could combine wget with revision control for space-savings. I am actually trying to write a script to do this for my usual browsing since I don't trust the Internet Archive or WebCite to be around indefinitely (and they are not very searchable).
The script would go something like this: cd to directory; invoke the correct
wget --mirror
command or whatever; rundarcs add $(find .)
to check into the repository any new files; thendarcs record --all
.Wget ought to overwrite any changed files with the updated version; darcs add will record any new files/directories; darcs record will save the changes.
To get the view as of date X, you simply pull from your repo all patches up to date X.
You don't store indefinitely many duplicate copies because DVCSs don't save history unless there's actual changes to file content. You will get 'garbage' in the sense of pages changing to no longer require CSS or JS or images you previously downloaded, but you could just periodically delete everything and record that as a patch, and the next wget invocation will only pull in what is needed for the latest version of a webpage. (And you can still do full-text search, just now you search the history rather than the files on-disk.)
(If there are big media files being downloaded, you can toss in something like
rm $(find . -size +2M)
to delete them before they getdarcs add
ed.)EDIT: I wound up not bothering with explicit version control, but letting wget create duplicates and occasionally weeding them with
fdupes
. See http://www.gwern.net/Archiving%20URLsWayBackMachine 可能会有所帮助。
The WayBackMachine might be able to help.
根据您的页面以及您所要求的具体内容,您可能会考虑将页面的副本放入源代码管理中。
如果您的内容位于数据库中,这可能不起作用,但如果它们只是您随时间更改的 HTML 页面,那么 SCM 将是执行此操作的正常方法。 每个人都提到的 WayBackMachine 很棒,但该解决方案更具公司特定性,允许您捕捉随时间变化的细微差别。 你无法控制 WayBackMachine(据我所知)。
在 Subversion 中,您可以设置挂钩并自动执行此操作。 事实上,如果您使用数据库中的内容,这甚至可能有效......
Depending on your pages and exactly what you are asking for you might consider putting copies of the pages in source control.
This probably won't work if your content is in a database but if they are just HTML pages that you are changing over time then SCM would be the normal way to do this. The WayBackMachine that everyone mentions is great but this solution is more company specific allowing you to capture ever nuance of changes over time. You have no control over the WayBackMachine (to my knowledge).
In Subversion, you can set up hooks and automate this. In fact, this might even work if you are using content from a database...