Python 中的 URL 树遍历器？

发布于 2024-07-16 17:10:05 字数 423 浏览 13 评论 0 原文

对于显示文件树的 URL，例如 Pypi 包，是否有一个小的实体模块可以遍历 URL 树并像 ls -lR 那样列出它？
我收集（纠正我）文件属性没有标准编码， html 属性中的链接类型、大小、日期... 因此，在流沙上构建可靠的 URLtree 模块非常困难。但肯定是这个轮子（Unix 文件树 -> html -> Treewalk API -> ls -lR 或 find）已经完成了吗？（那里似乎有几个蜘蛛/网络爬虫/刮刀，但到目前为止，尽管有 BeautifulSoup 进行解析，但它们看起来很丑陋并且是临时的）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

牛↙奶布丁 2024-07-23 17:10:05

Apache 服务器非常常见，它们有一个相对标准的列出文件目录的方式。

这是一个足够简单的脚本，可以执行您想要的操作，您应该能够使其执行您想要的操作。

用法：python list_apache_dir.py

import sys
import urllib
import re

parse_re = re.compile('href="([^"]*)".*(..-...-.... ..:..).*?(\d+[^\s<]*|-)')
          # look for          a link    +  a timestamp  + a size ('-' for dir)
def list_apache_dir(url):
    try:
        html = urllib.urlopen(url).read()
    except IOError, e:
        print 'error fetching %s: %s' % (url, e)
        return
    if not url.endswith('/'):
        url += '/'
    files = parse_re.findall(html)
    dirs = []
    print url + ' :' 
    print '%4d file' % len(files) + 's' * (len(files) != 1)
    for name, date, size in files:
        if size.strip() == '-':
            size = 'dir'
        if name.endswith('/'):
            dirs += [name]
        print '%5s  %s  %s' % (size, date, name)

    for dir in dirs:
        print
        list_apache_dir(url + dir)

for url in sys.argv[1:]:
    print
    list_apache_dir(url)

Apache servers are very common, and they have a relatively standard way of listing file directories.

Here's a simple enough script that does what you want, you should be able to make it do what you want.

Usage: python list_apache_dir.py

import sys
import urllib
import re

parse_re = re.compile('href="([^"]*)".*(..-...-.... ..:..).*?(\d+[^\s<]*|-)')
          # look for          a link    +  a timestamp  + a size ('-' for dir)
def list_apache_dir(url):
    try:
        html = urllib.urlopen(url).read()
    except IOError, e:
        print 'error fetching %s: %s' % (url, e)
        return
    if not url.endswith('/'):
        url += '/'
    files = parse_re.findall(html)
    dirs = []
    print url + ' :' 
    print '%4d file' % len(files) + 's' * (len(files) != 1)
    for name, date, size in files:
        if size.strip() == '-':
            size = 'dir'
        if name.endswith('/'):
            dirs += [name]
        print '%5s  %s  %s' % (size, date, name)

    for dir in dirs:
        print
        list_apache_dir(url + dir)

for url in sys.argv[1:]:
    print
    list_apache_dir(url)

回复收藏 0 原文

写给空气的情书 2024-07-23 17:10:05

其他人推荐了 BeautifulSoup，但使用 lxml 更好。尽管它的名字如此，它也用于解析和抓取 HTML。它比 BeautifulSoup 快得多。如果您不想学习 lxml API，它也有一个 BeautifulSoup 的兼容性 API。

Ian Blicking 同意。

没有理由再使用 BeautifulSoup，除非你使用的是 Google App Engine 或其他不允许使用非纯 Python 的东西。

它也有 CSS 选择器，所以这类事情很简单。

回复收藏 0 原文

财迷小姐 2024-07-23 17:10:05

事实证明，BeautifulSoup 像这样的俏皮话可以让 rows into Python——

from BeautifulSoup import BeautifulSoup

def trow_cols( trow ):
    """ soup.table( "tr" ) -> <td> strings like
        [None, u'Name', u'Last modified', u'Size', u'Description'] 
    """ 
    return [td.next.string for td in trow( "td" )]

def trow_headers( trow ):
    """ soup.table( "tr" ) -> <th> table header strings like
        [None, u'Achoo-1.0-py2.5.egg', u'11-Aug-2008 07:40  ', u'8.9K'] 
    """ 
    return [th.next.string for th in trow( "th" )]

if __name__ == "__main__":
    ...
    soup = BeautifulSoup( html )
    if soup.table:
        trows = soup.table( "tr" )
        print "headers:", trow_headers( trows[0] )
        for row in trows[1:]:
            print trow_cols( row )

与上面 sysrqb 的单行正则表达式相比，这……更长；
谁说

“您可以解析所有 html 的一些内容
时间，或者全部html的一些
时间，但不是......”

Turns out that BeautifulSoup one-liners like these can turn <table> rows into Python --

from BeautifulSoup import BeautifulSoup

def trow_cols( trow ):
    """ soup.table( "tr" ) -> <td> strings like
        [None, u'Name', u'Last modified', u'Size', u'Description'] 
    """ 
    return [td.next.string for td in trow( "td" )]

def trow_headers( trow ):
    """ soup.table( "tr" ) -> <th> table header strings like
        [None, u'Achoo-1.0-py2.5.egg', u'11-Aug-2008 07:40  ', u'8.9K'] 
    """ 
    return [th.next.string for th in trow( "th" )]

if __name__ == "__main__":
    ...
    soup = BeautifulSoup( html )
    if soup.table:
        trows = soup.table( "tr" )
        print "headers:", trow_headers( trows[0] )
        for row in trows[1:]:
            print trow_cols( row )

Compared to sysrqb's one-line regexp above, this is ... longer;
who said