获取列表的最快方法来自本地主机网站上所有页面的值

发布于 2024-07-09 02:54:09 字数 354 浏览 6 评论 0原文

我本质上是想抓取我的本地站点并创建所有标题和 URL 的列表，如下所示：

http://localhost/mySite/Default.aspx      My Home Page
http://localhost/mySite/Preferences.aspx  My Preferences
http://localhost/mySite/Messages.aspx     Messages

我正在运行 Windows。我对任何有用的东西都持开放态度——C# 控制台应用程序、PowerShell、一些现有工具等。我们可以假设该标签确实存在于文档中。

注意：我需要实际抓取文件，因为标题可能是在代码而不是标记中设置的。

原文

I essentially want to spider my local site and create a list of all the titles and URLs as in:

http://localhost/mySite/Default.aspx      My Home Page
http://localhost/mySite/Preferences.aspx  My Preferences
http://localhost/mySite/Messages.aspx     Messages

I'm running Windows. I'm open to anything that works--a C# console app, PowerShell, some existing tool, etc. We can assume that the tag does exist in the document.

Note: I need to actually spider the files since the title may be set in code rather than markup.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

差↓一点笑了 2024-07-16 02:54:09

一个快速而肮脏的 Cygwin Bash 脚本可以完成这项工作：

#!/bin/bash
for file in $(find $WWWROOT -iname \*.aspx); do
  echo -en $file '\t'
  cat $file | tr '\n' ' ' | sed -i 's/.*<title>\([^<]*\)<\/title>.*/\1/'
done

说明：这会找到根目录 $WWWROOT 下的每个 .aspx 文件，用空格替换所有换行符，以便 </code> 之间没有换行符> 和 <code>，然后抓取这些标签之间的文本。

A quick and dirty Cygwin Bash script which does the job:

#!/bin/bash
for file in $(find $WWWROOT -iname \*.aspx); do
  echo -en $file '\t'
  cat $file | tr '\n' ' ' | sed -i 's/.*<title>\([^<]*\)<\/title>.*/\1/'
done

Explanation: this finds every .aspx file under the root directory $WWWROOT, replaces all newlines with spaces so that there are no newlines between the <title> and </title>, and then grabs out the text between those tags.

回复收藏 0 原文