查找并删除孤立的网页、图像和其他相关文件
我正在处理许多网站,其文件可追溯到 2000 年。这些网站随着时间的推移而有机增长,导致大量孤立网页,包括文件、图像、CSS 文件、JavaScript 文件等...这些孤立文件导致许多问题包括可维护性差、可能的安全漏洞、糟糕的客户体验以及让像我这样的 OCD/GTD 怪胎发疯。
这些文件的数量有数千个,因此完全手动的解决方案是不可行的。最终,清理过程将需要相当大的质量检查工作,以确保我们不会无意中删除所需的文件,但我希望开发一种技术解决方案来帮助加快手动工作速度。此外,我希望将流程/实用程序落实到位,以帮助防止将来发生这种混乱状态。
环境注意事项:
- 经典 ASP 和 .Net
- 运行 IIS 6 和 IIS 7 的
- Windows 服务器多个环境(开发、集成、QA、阶段、生产)
- 用于源代码控制的 TFS
在开始之前,我想从已成功导航的其他人那里获得一些反馈类似的过程。
具体来说,我正在寻找:
- 识别和清理孤立文件的过程
- 保持环境中没有孤立文件的过程
- 帮助识别孤立文件的实用程序
- 帮助识别损坏的链接的实用程序(一旦文件被删除)
我不是在寻找:
- 我的解决方案组织强迫症……我喜欢我现在的样子。
- Snide 评论说我们仍然使用经典 ASP。我已经感觉到疼痛了。没有必要把它擦进去。
I am working on a number of websites with files dating back to 2000. These sites have grown organically over time resulting in large numbers of orphaned web pages, include files, images, CSS files, JavaScript files, etc... These orphaned files cause a number of problems including poor maintainability, possible security holes, poor customer experience, and driving OCD/GTD freaks like myself crazy.
These files number in the thousands so a completely manual solution is not feasible. Ultimately, the cleanup process will require a fairly large QA effort in order to ensure we have not inadvertently deleted needed files but I am hoping to develop a technological solution to help speed the manual effort. Additionally, I hope to put processes/utilities in place to help prevent this state of disorganization from happening in the future.
Environment Considerations:
- Classic ASP and .Net
- Windows servers running IIS 6 and IIS 7
- Multiple environments (Dev, Integration, QA, Stage, Prodction)
- TFS for source control
Before I start I would like to get some feedback from others who have successfully navigated a similar process.
Specifically I am looking for:
- Process for identifying and cleaning up orphaned files
- Process for keeping environments clean from orphaned files
- Utilities that help identify orphaned files
- Utilities that help identify broken links (once files have been removed)
I am not looking for:
- Solutions to my organizational OCD...I like how I am.
- Snide comments about us still using classic ASP. I already feel the pain. There is no need to rub it in.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
起初,我认为您可以通过扫描文件中的链接,然后对文件夹结构进行比较来摆脱困境 - 但这只能识别简单的孤立文件,而不是相互引用的孤立文件的集合。因此,使用 grep 可能无法帮助您实现这一目标。
这不是一个简单的解决方案,但却是保持环境清洁的绝佳实用程序(因此值得付出努力)。另外,您可以在所有环境中重复使用它(并与其他人共享!)
其基本思想是设置和填充有向图,其中每个节点的键都是绝对路径。这是通过扫描所有文件并添加依赖项来完成的 - 例如:
然后,您可以通过在根页面上执行 BFS 来识别所有“可访问”文件。
使用方向图,您还可以按文件的进出度对文件进行分类。在上面的示例中:
因此,您基本上是在寻找已放弃的 in = 0 的文件。
此外,out = 0 的文件将成为终端页面;这可能会或可能不会在您的网站上出现(如错误所示,这是一个错误页面)。
At first I thought you could get away by scanning files for links, and then doing a diff against your folder structure - but this only identifies simple orphans, not collections of orphaned files that reference each other. So, using grep probably won't get you all the way there.
This isn't a trivial solution, but would make an excellent utility for keeping your environment clean (and therefore, worth the effort). Plus, you can re-use it across all environments (and share it with others!)
The basic idea is to setup and populate a directional graph where each node's key is an absolute path. This is done by scanning all the files and adding dependencies - for example:
Then, you can identify all your "reachable" files by doing a BFS on your root page.
With the directional graph, you can also classify files by their in and out degree. In the example above:
So, you're basically looking for files that have in = 0 that are abandoned.
Additionally, files that have out = 0 are going to be terminal pages; which may or may not be desirable on your site (as error suggests, it's an error page).
第 1 步:建立网站上绝对可见的页面列表。创建此列表的一种明智方法是解析日志文件以查找人们访问的页面。
步骤 2:运行一个递归查找站点拓扑的工具,从专门编写的页面(您将在站点上创建的)开始,该页面具有指向步骤 1 中每个页面的链接。可以执行此操作的一个工具是 Xenu 的链接侦探。它旨在查找死链接,但它也会列出活动链接。这可以在外部运行,因此在您的服务器上安装“奇怪”的软件不存在安全问题。您需要偶尔注意这一点,因为如果您有错误或其他任何情况,您的网站可能有无限的页面等。
步骤 3:运行一个从站点 Web 目录开始递归映射硬盘的工具。我无法立即想到其中任何一个,但是编写一个应该很简单,并且更安全,因为它将在您的服务器上运行。
步骤 4:以编程方式获取步骤 2 和 3 的结果,将 #2 与 #3 进行匹配。 #3 中而不是 #2 中的任何内容都可能是孤立页面。
注意:此技术对于受密码保护的内容效果不佳,对于严重依赖动态生成的链接的网站也效果不佳(如果链接一致,动态内容就可以)。
Step 1: Establish a list of pages on your site which are definitely visible. One intelligent way to create this list is to parse your log files for pages people visit.
Step 2: Run a tool that recursively finds site topology, starting from a specially written page (that you will make on your site) which has a link to each page in step 1. One tool which can do this is Xenu's Link Sleuth. It's intended for finding dead links, but it will list live links as well. This can be run externally, so there are no security concerns with installing 'weird' software onto your server. You'll need to watch over this occasionally since your site may have infinite pages and the like if you have bugs or whatever.
Step 3: Run a tool that recursively maps your hard disk, starting from your site web directory. I can't think of any of these off the top of my head, but writing one should be trivial, and is safer since this will be run on your server.
Step 4: Take the results of steps 2 and 3 programmatically match #2 against #3. Anything in #3 not in #2 is potentially an orphan page.
Note: This technique works poorly with password-protected stuff, and also works poorly with sites relying heavily on dynamically generated links (dynamic content is fine if the links are consistent).
这里没有讽刺的评论...我感受到你的痛苦,因为我们网站的很大一部分仍然是经典的 ASP。
我不知道有什么全自动系统可以成为灵丹妙药,但我有一些可以提供帮助的想法。至少我们是这样清理我们的网站的。
首先,虽然它看起来不像是完成此类工作的工具,但我已经使用 Microsoft Viso 来帮助完成此任务。我们有Visio for Enterprise Architects,我不确定这个功能在其他版本中是否有,但是在这个版本中,你可以创建一个新文档,并且在“Web图表”文件夹下的“选择绘图类型”中,有“网站地图”选项(公制或美制单位 - 没关系)。
创建此绘图类型时,Visio 会提示您输入网站的 URL,然后为您爬网您的网站。
这应该有助于识别哪些文件是有效的。它并不完美,但我们使用它的方式是在文件系统中查找 Visio 绘图中未显示的文件,然后在 Visual Studio 中提取整个解决方案并搜索该文件名。如果我们在整个解决方案中找不到它,我们会将其移至“过时”文件夹中一个月,如果我们没有开始在网站上收到投诉或 404 错误,则将其删除。
其他可能的解决方案是使用日志文件解析器并解析最近 n 个月的日志,并以这种方式查找丢失的文件,但这本质上需要大量编码才能得出一个列表“已知良好”的文件实际上并不比 Visio 选项更好。
No snide comments here... I feel your pain as a large portion of our site is still in classic ASP.
I don't know of any fully automated systems that will be a magic bullet, but I dd have a couple of ideas for what could help. At least it's how we cleaned up our site.
First, although it hardly seems like the tool for such a job, I've used Microsoft Viso to help with this. We have Visio for Enterprise Architects, and I am not sure if this feature is in other versions, but in this version, you can create a new document, and in the "choose drawing type" under the "Web Diagram" folder, there is an option for a "Web Site Map" (either Metric or US units - it doesn't matter).
When you create this drawing type, Visio prompts you for the URL of your web site, and then goes out and crawls your web site for you.
This should help to identify which files are valid. It's not perfect, but the way we used it was to find the files in the file system that did not show up in the Visio drawing, and then pull up the entire solution in Visual Studio and do a search for that file name. If we could not find it in the entire solution, we moved it off into an "Obsolete" folder for a month, and deleted it if we didn't start getting complaints or 404 errors on the web site.
Other possible solutions would be to use log file parser and parse your logs for the last n months and look for missing files this way, but that would essentially be a lot of coding to come up with a list of "known good" files that's really no better than the Visio option.
去过那里,做过很多次。为什么内容类型不能自行清理?就我个人而言,我会这样做:
1)获取在 QA 环境中运行的站点的副本。
2)使用selinum(或其他一些基于浏览器的测试工具)为有效的东西创建一套测试。
3)开始删除该删除的东西。
4) 删除内容后运行#2 中的测试以确保它仍然有效。
5) 重复 #s 3 & 4.直到满意为止。
Been there, done that many times. Why can't the content types clean up after themselves? Personally, I'd hit it something like this:
1) Get a copy of the site running in a QA environment.
2) Use selinum (or some other browser-based testing tool) to create a suite of tests for stuff that works.
3) Start deleting stuff that should be deleted.
4) Run tests from #2 after deleting stuff to insure it still works.
5) Repeat #s 3 & 4 until satisfied.