使用 python 和 BeautifulSoup 从网页检索链接

以酷 2024-08-02 06:37:59

下面是在 BeautifulSoup 中使用 SoupStrainer 类的一个简短片段：

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(resp, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

BeautifulSoup 文档实际上相当不错，涵盖了许多典型场景：

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

编辑：请注意，我使用了 SoupStrainer 类，因为它的效率更高一些（内存和速度方面）），如果你提前知道你要解析什么。

Here's a short snippet using the SoupStrainer class in BeautifulSoup:

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(resp, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

The BeautifulSoup documentation is actually quite good, and covers a number of typical scenarios:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Edit: Note that I used the SoupStrainer class because it's a bit more efficient (memory and speed wise), if you know what you're parsing in advance.

回复收藏 0 原文

情释 2024-08-02 06:37:59

为了完整起见，BeautifulSoup 4 版本也使用服务器提供的编码：

from bs4 import BeautifulSoup
import urllib.request

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

或 Python 2 版本：

from bs4 import BeautifulSoup
import urllib2

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

以及使用 requests 库，该库在 Python 2 和 3 中均适用：

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

soup.find_all('a', href=True) 调用查找所有具有 href 属性的元素；没有该属性的元素将被跳过。

BeautifulSoup 3于2012年3月停止开发；新项目确实应该始终使用 BeautifulSoup 4。

请注意，您应该将 HTML 从字节解码到 BeautifulSoup。您可以告知 BeautifulSoup 在 HTTP 响应标头中找到的字符集以协助解码，但这可能是错误的，并且与在HTML 本身，这就是为什么上面使用 BeautifulSoup 内部类方法 EncodingDetector.find_declared_encoding() 来确保此类嵌入的编码提示胜过配置错误的服务器。

对于 requests，如果响应具有 text/* mimetype，则 response.encoding 属性默认为 Latin-1，即使没有返回字符集。这与 HTTP RFC 一致，但与 HTML 解析一起使用时会很痛苦，因此当 Content-Type 标头中未设置 charset 时，您应该忽略该属性。

For completeness sake, the BeautifulSoup 4 version, making use of the encoding supplied by the server as well:

from bs4 import BeautifulSoup
import urllib.request

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

or the Python 2 version:

from bs4 import BeautifulSoup
import urllib2

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

and a version using the requests library, which as written will work in both Python 2 and 3:

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

The soup.find_all('a', href=True) call finds all <a> elements that have an href attribute; elements without the attribute are skipped.

BeautifulSoup 3 stopped development in March 2012; new projects really should use BeautifulSoup 4, always.

Note that you should leave decoding the HTML from bytes to BeautifulSoup. You can inform BeautifulSoup of the characterset found in the HTTP response headers to assist in decoding, but this can be wrong and conflicting with a <meta> header info found in the HTML itself, which is why the above uses the BeautifulSoup internal class method EncodingDetector.find_declared_encoding() to make sure that such embedded encoding hints win over a misconfigured server.

With requests, the response.encoding attribute defaults to Latin-1 if the response has a text/* mimetype, even if no characterset was returned. This is consistent with the HTTP RFCs but painful when used with HTML parsing, so you should ignore that attribute when no charset is set in the Content-Type header.

回复收藏 0 原文

孤单情人 2024-08-02 06:37:59

其他人推荐了 BeautifulSoup，但使用 lxml 更好。尽管它的名字如此，它也用于解析和抓取 HTML。它比 BeautifulSoup 快得多，甚至比 BeautifulSoup（他们声名鹊起）更好地处理“损坏的”HTML。如果您不想学习 lxml API，它也有一个 BeautifulSoup 的兼容性 API。

Ian Blicking 同意。

没有理由再使用 BeautifulSoup，除非你使用的是 Google App Engine 或其他不允许使用非纯 Python 的东西。

lxml.html 还支持 CSS3 选择器，所以这种事情很简单。

使用 lxml 和 xpath 的示例如下所示：

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

lxml.html also supports CSS3 selectors so this sort of thing is trivial.

An example with lxml and xpath would look like this:

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

回复收藏 0 原文

八巷 2024-08-02 06:37:59

import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'

import urllib2
import BeautifulSoup

request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
for a in soup.findAll('a'):
  if 'national-park' in a['href']:
    print 'found a url with national-park in the link'

回复收藏 0 原文

生生漫 2024-08-02 06:37:59

以下代码是使用 urllib2 和 BeautifulSoup4 检索网页中的所有可用链接：

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
    print(line.get('href'))

The following code is to retrieve all the links available in a webpage using urllib2 and BeautifulSoup4:

import urllib2
from bs4 import BeautifulSoup

url = urllib2.urlopen("http://www.espncricinfo.com/").read()
soup = BeautifulSoup(url)

for line in soup.find_all('a'):
    print(line.get('href'))

回复收藏 0 原文

非要怀念 2024-08-02 06:37:59

链接可以位于各种属性中，因此您可以将这些属性的列表传递给 select。

例如，对于 src 和 href 属性（这里我使用以 ^ 运算符开头来指定这些属性值中的任何一个以 http 开头）：

from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)

属性 = 值选择器

[attr^=值]

表示属性名称为 attr 的元素，其值以 value 为前缀（前面）。

还有常用的 $ （结尾为）和 * （包含）运算符。有关完整语法列表，请参阅上面的链接。

Links can be within a variety of attributes so you could pass a list of those attributes to select.

For example, with src and href attributes (here I am using the starts with ^ operator to specify that either of these attributes values starts with http):

from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://stackoverflow.com/')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="http"], [src^="http"]') ]
print(links)

Attribute = value selectors

[attr^=value]

Represents elements with an attribute name of attr whose value is prefixed (preceded) by value.

There are also the commonly used $ (ends with) and * (contains) operators. For a full syntax list see the link above.

回复收藏 0 原文

回忆凄美了谁 2024-08-02 06:37:59

BeautifulSoup 现在在底层使用 lxml。请求、lxml 和列表推导式是一个杀手组合。

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

在列表组合中，“if '//' and 'url.com' not in x”是一种简单的方法，用于清理站点“内部”导航 URL 等的 URL 列表。

Under the hood BeautifulSoup now uses lxml. Requests, lxml & list comprehensions makes a killer combo.

import requests
import lxml.html

dom = lxml.html.fromstring(requests.get('http://www.nytimes.com').content)

[x for x in dom.xpath('//a/@href') if '//' in x and 'nytimes.com' not in x]

In the list comp, the "if '//' and 'url.com' not in x" is a simple method to scrub the url list of the sites 'internal' navigation urls, etc.

回复收藏 0 原文

七度光 2024-08-02 06:37:59

该脚本可以满足您的需求，而且还可以将相对链接解析为绝对链接。

import urllib
import lxml.html
import urlparse

def get_dom(url):
    connection = urllib.urlopen(url)
    return lxml.html.fromstring(connection.read())

def get_links(url):
    return resolve_links((link for link in get_dom(url).xpath('//a/@href')))

def guess_root(links):
    for link in links:
        if link.startswith('http'):
            parsed_link = urlparse.urlparse(link)
            scheme = parsed_link.scheme + '://'
            netloc = parsed_link.netloc
            return scheme + netloc

def resolve_links(links):
    root = guess_root(links)
    for link in links:
        if not link.startswith('http'):
            link = urlparse.urljoin(root, link)
        yield link  

for link in get_links('http://www.google.com'):
    print link

This script does what your looking for, But also resolves the relative links to absolute links.

import urllib
import lxml.html
import urlparse

def get_dom(url):
    connection = urllib.urlopen(url)
    return lxml.html.fromstring(connection.read())

def get_links(url):
    return resolve_links((link for link in get_dom(url).xpath('//a/@href')))

def guess_root(links):
    for link in links:
        if link.startswith('http'):
            parsed_link = urlparse.urlparse(link)
            scheme = parsed_link.scheme + '://'
            netloc = parsed_link.netloc
            return scheme + netloc

def resolve_links(links):
    root = guess_root(links)
    for link in links:
        if not link.startswith('http'):
            link = urlparse.urljoin(root, link)
        yield link  

for link in get_links('http://www.google.com'):
    print link

回复收藏 0 原文

那请放手 2024-08-02 06:37:59

为了查找所有链接，我们将在本例中一起使用 urllib2 模块
使用 re.module
*re 模块中最强大的函数之一是“re.findall()”。
re.search() 用于查找模式的第一个匹配项，而 re.findall() 则查找所有
匹配并将它们作为字符串列表返回，每个字符串代表一个匹配*

import urllib2

import re
#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

To find all the links, we will in this example use the urllib2 module together
with the re.module
*One of the most powerful function in the re module is "re.findall()".
While re.search() is used to find the first match for a pattern, re.findall() finds all
the matches and returns them as a list of strings, with each string representing one match*

import urllib2

import re
#connect to a URL
website = urllib2.urlopen(url)

#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

回复收藏 0 原文

风吹雨成花 2024-08-02 06:37:59

只是为了获取链接，没有 B.soup 和正则表达式：

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

对于更复杂的操作，当然 BSoup 仍然是首选。

just for getting the links, without B.soup and regex:

import urllib2
url="http://www.somewhere.com"
page=urllib2.urlopen(url)
data=page.read().split("</a>")
tag="<a href=\""
endtag="\">"
for item in data:
    if "<a href" in item:
        try:
            ind = item.index(tag)
            item=item[ind+len(tag):]
            end=item.index(endtag)
        except: pass
        else:
            print item[:end]

for more complex operations, of course BSoup is still preferred.

回复收藏 0 原文

日暮斜阳 2024-08-02 06:37:59

为什么不使用正则表达式：

import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
    print('href: %s, HTML text: %s' % (link[0], link[1]))

Why not use regular expressions:

import urllib2
import re
url = "http://www.somewhere.com"
page = urllib2.urlopen(url)
page = page.read()
links = re.findall(r"<a.*?\s*href=\"(.*?)\".*?>(.*?)</a>", page)
for link in links:
    print('href: %s, HTML text: %s' % (link[0], link[1]))

回复收藏 0 原文

七色彩虹 2024-08-02 06:37:59

下面是一个使用 @ars 接受的答案和 BeautifulSoup4、requests 和 wget 模块来处理下载的示例。

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'

response = requests.get(url)

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path = url + link['href']
            wget.download(full_path)

Here's an example using @ars accepted answer and the BeautifulSoup4, requests, and wget modules to handle the downloads.

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg_full/'
file_type = '.tar.gz'

response = requests.get(url)

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path = url + link['href']
            wget.download(full_path)

回复收藏 0 原文

洛阳烟雨空心柳 2024-08-02 06:37:59

经过以下更正（涵盖无法正常工作的场景）后，我找到了@Blairg23工作的答案：

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
            wget.download(full_path)

对于Python 3：

必须使用 urllib.parse.urljoin 才能获得而是使用完整的 URL。

I found the answer by @Blairg23 working , after the following correction (covering the scenario where it failed to work correctly):

for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        if file_type in link['href']:
            full_path =urlparse.urljoin(url , link['href']) #module urlparse need to be imported
            wget.download(full_path)

For Python 3:

urllib.parse.urljoin has to be used in order to obtain the full URL instead.

回复收藏 0 原文

久而酒知 2024-08-02 06:37:59

BeatifulSoup 自己的解析器可能很慢。使用 lxml 可能更可行，它能够直接从 URL 解析（有下面提到的一些限制）。

import lxml.html

doc = lxml.html.parse(url)

links = doc.xpath('//a[@href]')

for link in links:
    print link.attrib['href']

上面的代码将按原样返回链接，在大多数情况下，它们将是相对链接或来自站点根目录的绝对链接。由于我的用例是仅提取某种类型的链接，因此下面的版本可以将链接转换为完整 URL，并且可以选择接受 *.mp3 等 glob 模式。虽然它不会处理相对路径中的单点和双点，但到目前为止我还不需要它。如果您需要解析包含 ../ 或 ./ 的 URL 片段，则 urlparse.urljoin 可能会派上用场。

注意：直接 lxml url 解析不处理从 https 的加载，也不执行重定向，因此下面的版本使用 urllib2 > + lxml。

#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch

try:
    import urltools as urltools
except ImportError:
    sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
    urltools = None


def get_host(url):
    p = urlparse.urlparse(url)
    return "{}://{}".format(p.scheme, p.netloc)


if __name__ == '__main__':
    url = sys.argv[1]
    host = get_host(url)
    glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'

    doc = lxml.html.parse(urllib2.urlopen(url))
    links = doc.xpath('//a[@href]')

    for link in links:
        href = link.attrib['href']

        if fnmatch.fnmatch(href, glob_patt):

            if not href.startswith(('http://', 'https://' 'ftp://')):

                if href.startswith('/'):
                    href = host + href
                else:
                    parent_url = url.rsplit('/', 1)[0]
                    href = urlparse.urljoin(parent_url, href)

                    if urltools:
                        href = urltools.normalize(href)

            print href

用法如下：

getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

BeatifulSoup's own parser can be slow. It might be more feasible to use lxml which is capable of parsing directly from a URL (with some limitations mentioned below).

import lxml.html

doc = lxml.html.parse(url)

links = doc.xpath('//a[@href]')

for link in links:
    print link.attrib['href']

The code above will return the links as is, and in most cases they would be relative links or absolute from the site root. Since my use case was to only extract a certain type of links, below is a version that converts the links to full URLs and which optionally accepts a glob pattern like *.mp3. It won't handle single and double dots in the relative paths though, but so far I didn't have the need for it. If you need to parse URL fragments containing ../ or ./ then urlparse.urljoin might come in handy.

NOTE: Direct lxml url parsing doesn't handle loading from https and doesn't do redirects, so for this reason the version below is using urllib2 + lxml.

#!/usr/bin/env python
import sys
import urllib2
import urlparse
import lxml.html
import fnmatch

try:
    import urltools as urltools
except ImportError:
    sys.stderr.write('To normalize URLs run: `pip install urltools --user`')
    urltools = None


def get_host(url):
    p = urlparse.urlparse(url)
    return "{}://{}".format(p.scheme, p.netloc)


if __name__ == '__main__':
    url = sys.argv[1]
    host = get_host(url)
    glob_patt = len(sys.argv) > 2 and sys.argv[2] or '*'

    doc = lxml.html.parse(urllib2.urlopen(url))
    links = doc.xpath('//a[@href]')

    for link in links:
        href = link.attrib['href']

        if fnmatch.fnmatch(href, glob_patt):

            if not href.startswith(('http://', 'https://' 'ftp://')):

                if href.startswith('/'):
                    href = host + href
                else:
                    parent_url = url.rsplit('/', 1)[0]
                    href = urlparse.urljoin(parent_url, href)

                    if urltools:
                        href = urltools.normalize(href)

            print href

The usage is as follows:

getlinks.py http://stackoverflow.com/a/37758066/191246
getlinks.py http://stackoverflow.com/a/37758066/191246 "*users*"
getlinks.py http://fakedomain.mu/somepage.html "*.mp3"

回复收藏 0 原文

丢了幸福的猪 2024-08-02 06:37:59

可能存在许多重复的链接以及外部和内部链接。要区分两者并使用集合获取唯一链接：

# Python 3.
import urllib    
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))  
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
    link = line.get('href')
    if not link:
        continue
    if link.startswith('http'):
        external_links.add(link)
    else:
        internal_links.add(link)

# Depending on usage, full internal links may be preferred.
full_internal_links = {
    urllib.parse.urljoin(url, internal_link) 
    for internal_link in internal_links
}

# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
    print(link)

There can be many duplicate links together with both external and internal links. To differentiate between the two and just get unique links using sets:

# Python 3.
import urllib    
from bs4 import BeautifulSoup

url = "http://www.espncricinfo.com/"
resp = urllib.request.urlopen(url)
# Get server encoding per recommendation of Martijn Pieters.
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))  
external_links = set()
internal_links = set()
for line in soup.find_all('a'):
    link = line.get('href')
    if not link:
        continue
    if link.startswith('http'):
        external_links.add(link)
    else:
        internal_links.add(link)

# Depending on usage, full internal links may be preferred.
full_internal_links = {
    urllib.parse.urljoin(url, internal_link) 
    for internal_link in internal_links
}

# Print all unique external and full internal links.
for link in external_links.union(full_internal_links):
    print(link)

回复收藏 0 原文

离旧人 2024-08-02 06:37:59

import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

import urllib2
from bs4 import BeautifulSoup
a=urllib2.urlopen('http://dir.yahoo.com')
code=a.read()
soup=BeautifulSoup(code)
links=soup.findAll("a")
#To get href part alone
print links[0].attrs['href']

回复收藏 0 原文

使用 python 和 BeautifulSoup 从网页检索链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（16）

关于作者

相关话题

热门标签

推荐作者

qq_FjTq5B

18273202778

WordPress小学生

〃温暖了心ぐ

迷乱花海

niuniu

友情链接

使用 python 和 BeautifulSoup 从网页检索链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（16）

关于作者

相关话题

热门标签

推荐作者

qq_FjTq5B

18273202778

WordPress小学生

〃温暖了心ぐ

迷乱花海

niuniu

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。