Python 中的 URL 树遍历器?

发布于 2024-07-16 17:10:05 字数 423 浏览 13 评论 0 原文

对于显示文件树的 URL,例如 Pypi 包, 是否有一个小的实体模块可以遍历 URL 树并像 ls -lR 那样列出它?
我收集(纠正我)文件属性没有标准编码, html 属性中的链接类型、大小、日期... 因此,在流沙上构建可靠的 URLtree 模块非常困难。
但肯定是这个轮子(Unix 文件树 -> html -> Treewalk API -> ls -lR 或 find) 已经完成了吗?
(那里似乎有几个蜘蛛/网络爬虫/刮刀,但到目前为止,尽管有 BeautifulSoup 进行解析,但它们看起来很丑陋并且是临时的)。

For URLs that show file trees, such as Pypi packages,
is there a small solid module to walk the URL tree and list it like ls -lR?
I gather (correct me) that there's no standard encoding of file attributes,
link types, size, date ... in html <A attributes
so building a solid URLtree module on shifting sands is tough.
But surely this wheel (Unix file tree -> html -> treewalk API -> ls -lR or find)
has been done?
(There seem to be several spiders / web crawlers / scrapers out there, but they look ugly and ad hoc so far, despite BeautifulSoup for parsing).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

牛↙奶布丁 2024-07-23 17:10:05

Apache 服务器非常常见,它们有一个相对标准的列出文件目录的方式。

这是一个足够简单的脚本,可以执行您想要的操作,您应该能够使其执行您想要的操作。

用法:python list_apache_dir.py

import sys
import urllib
import re

parse_re = re.compile('href="([^"]*)".*(..-...-.... ..:..).*?(\d+[^\s<]*|-)')
          # look for          a link    +  a timestamp  + a size ('-' for dir)
def list_apache_dir(url):
    try:
        html = urllib.urlopen(url).read()
    except IOError, e:
        print 'error fetching %s: %s' % (url, e)
        return
    if not url.endswith('/'):
        url += '/'
    files = parse_re.findall(html)
    dirs = []
    print url + ' :' 
    print '%4d file' % len(files) + 's' * (len(files) != 1)
    for name, date, size in files:
        if size.strip() == '-':
            size = 'dir'
        if name.endswith('/'):
            dirs += [name]
        print '%5s  %s  %s' % (size, date, name)

    for dir in dirs:
        print
        list_apache_dir(url + dir)

for url in sys.argv[1:]:
    print
    list_apache_dir(url) 

Apache servers are very common, and they have a relatively standard way of listing file directories.

Here's a simple enough script that does what you want, you should be able to make it do what you want.

Usage: python list_apache_dir.py

import sys
import urllib
import re

parse_re = re.compile('href="([^"]*)".*(..-...-.... ..:..).*?(\d+[^\s<]*|-)')
          # look for          a link    +  a timestamp  + a size ('-' for dir)
def list_apache_dir(url):
    try:
        html = urllib.urlopen(url).read()
    except IOError, e:
        print 'error fetching %s: %s' % (url, e)
        return
    if not url.endswith('/'):
        url += '/'
    files = parse_re.findall(html)
    dirs = []
    print url + ' :' 
    print '%4d file' % len(files) + 's' * (len(files) != 1)
    for name, date, size in files:
        if size.strip() == '-':
            size = 'dir'
        if name.endswith('/'):
            dirs += [name]
        print '%5s  %s  %s' % (size, date, name)

    for dir in dirs:
        print
        list_apache_dir(url + dir)

for url in sys.argv[1:]:
    print
    list_apache_dir(url) 
写给空气的情书 2024-07-23 17:10:05

其他人推荐了 BeautifulSoup,但使用 lxml 更好。 尽管它的名字如此,它也用于解析和抓取 HTML。 它比 BeautifulSoup 快得多。 如果您不想学习 lxml API,它也有一个 BeautifulSoup 的兼容性 API。

Ian Blicking 同意

没有理由再使用 BeautifulSoup,除非你使用的是 Google App Engine 或其他不允许使用非纯 Python 的东西。

它也有 CSS 选择器,所以这类事情很简单。

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

It has CSS selectors as well so this sort of thing is trivial.

财迷小姐 2024-07-23 17:10:05

事实证明,BeautifulSoup 像这样的俏皮话可以让 rows into Python——

from BeautifulSoup import BeautifulSoup

def trow_cols( trow ):
    """ soup.table( "tr" ) -> <td> strings like
        [None, u'Name', u'Last modified', u'Size', u'Description'] 
    """ 
    return [td.next.string for td in trow( "td" )]

def trow_headers( trow ):
    """ soup.table( "tr" ) -> <th> table header strings like
        [None, u'Achoo-1.0-py2.5.egg', u'11-Aug-2008 07:40  ', u'8.9K'] 
    """ 
    return [th.next.string for th in trow( "th" )]

if __name__ == "__main__":
    ...
    soup = BeautifulSoup( html )
    if soup.table:
        trows = soup.table( "tr" )
        print "headers:", trow_headers( trows[0] )
        for row in trows[1:]:
            print trow_cols( row )

与上面 sysrqb 的单行正则表达式相比,这……更长;
谁说

“您可以解析所有 html 的一些内容
时间,或者全部html的一些
时间,但不是......”

Turns out that BeautifulSoup one-liners like these can turn <table> rows into Python --

from BeautifulSoup import BeautifulSoup

def trow_cols( trow ):
    """ soup.table( "tr" ) -> <td> strings like
        [None, u'Name', u'Last modified', u'Size', u'Description'] 
    """ 
    return [td.next.string for td in trow( "td" )]

def trow_headers( trow ):
    """ soup.table( "tr" ) -> <th> table header strings like
        [None, u'Achoo-1.0-py2.5.egg', u'11-Aug-2008 07:40  ', u'8.9K'] 
    """ 
    return [th.next.string for th in trow( "th" )]

if __name__ == "__main__":
    ...
    soup = BeautifulSoup( html )
    if soup.table:
        trows = soup.table( "tr" )
        print "headers:", trow_headers( trows[0] )
        for row in trows[1:]:
            print trow_cols( row )

Compared to sysrqb's one-line regexp above, this is ... longer;
who said

"You can parse some of the html all of
the time, or all of the html some of
the time, but not ..."

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文