使用python的urllib2和Beautifulsoup抓取维基百科时删除html标签

发布于 2024-12-14 13:41:16 字数 3354 浏览 5 评论 0原文

我正在尝试抓取维基百科以获取一些用于文本挖掘的数据。我正在使用 python 的 urllib2 和 Beautifulsoup。我的问题是：有没有一种简单的方法可以从我阅读的文本中删除不必要的标签（例如链接“a”或“span”）。

对于这种情况：

import urllib2
from BeautifulSoup import *
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open("http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes")pool = BeautifulSoup(infile.read())
res=pool.findAll('div',attrs={'class' : 'mw-content-ltr'}) # to get to content directly
paragrapgs=res[0].findAll("p") #get all paragraphs

我得到带有大量参考标签的段落，例如：

paragrapgs[0] =

<p><b>Data mining</b> (the analysis step of the <b>knowledge discovery in databases</b> process,<sup id="cite_ref-Fayyad_0-0" class="reference"><a href="#cite_note-Fayyad-0"><span>[</span>1<span>]</span></a></sup> or KDD), a relatively young and interdisciplinary field of <a href="/wiki/Computer_science" title="Computer science">computer science</a><sup id="cite_ref-acm_1-0" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup><sup id="cite_ref-brittanica_2-0" class="reference"><a href="#cite_note-brittanica-2"><span>[</span>3<span>]</span></a></sup> is the process of discovering new patterns from large <a href="/wiki/Data_set" title="Data set">data sets</a> involving methods at the intersection of <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a>, <a href="/wiki/Machine_learning" title="Machine learning">machine learning</a>, <a href="/wiki/Statistics" title="Statistics">statistics</a> and <a href="/wiki/Database_system" title="Database system">database systems</a>.<sup id="cite_ref-acm_1-1" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup> The goal of data mining is to extract knowledge from a data set in a human-understandable structure<sup id="cite_ref-acm_1-2" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup> and involves database and <a href="/wiki/Data_management" title="Data management">data management</a>, <a href="/wiki/Data_Pre-processing" title="Data Pre-processing">data preprocessing</a>, <a href="/wiki/Statistical_model" title="Statistical model">model</a> and <a href="/wiki/Statistical_inference" title="Statistical inference">inference</a> considerations, interestingness metrics, <a href="/wiki/Computational_complexity_theory" title="Computational complexity theory">complexity</a> considerations, post-processing of found structure, <a href="/wiki/Data_visualization" title="Data visualization">visualization</a> and <a href="/wiki/Online_algorithm" title="Online algorithm">online updating</a>.<sup id="cite_ref-acm_1-3" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup></p>

有什么想法如何删除它们并拥有纯文本吗？

原文

I am trying to crawl wikipedia to get some data for text mining. I am using python's urllib2 and Beautifulsoup. My question is that: is there an easy way of getting rid of the unnecessary tags(like links 'a's or 'span's) from the text I read.

for this scenario:

import urllib2
from BeautifulSoup import *
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open("http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes")pool = BeautifulSoup(infile.read())
res=pool.findAll('div',attrs={'class' : 'mw-content-ltr'}) # to get to content directly
paragrapgs=res[0].findAll("p") #get all paragraphs

I get the paragraphs with lots of reference tags like:

paragrapgs[0] =

<p><b>Data mining</b> (the analysis step of the <b>knowledge discovery in databases</b> process,<sup id="cite_ref-Fayyad_0-0" class="reference"><a href="#cite_note-Fayyad-0"><span>[</span>1<span>]</span></a></sup> or KDD), a relatively young and interdisciplinary field of <a href="/wiki/Computer_science" title="Computer science">computer science</a><sup id="cite_ref-acm_1-0" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup><sup id="cite_ref-brittanica_2-0" class="reference"><a href="#cite_note-brittanica-2"><span>[</span>3<span>]</span></a></sup> is the process of discovering new patterns from large <a href="/wiki/Data_set" title="Data set">data sets</a> involving methods at the intersection of <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a>, <a href="/wiki/Machine_learning" title="Machine learning">machine learning</a>, <a href="/wiki/Statistics" title="Statistics">statistics</a> and <a href="/wiki/Database_system" title="Database system">database systems</a>.<sup id="cite_ref-acm_1-1" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup> The goal of data mining is to extract knowledge from a data set in a human-understandable structure<sup id="cite_ref-acm_1-2" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup> and involves database and <a href="/wiki/Data_management" title="Data management">data management</a>, <a href="/wiki/Data_Pre-processing" title="Data Pre-processing">data preprocessing</a>, <a href="/wiki/Statistical_model" title="Statistical model">model</a> and <a href="/wiki/Statistical_inference" title="Statistical inference">inference</a> considerations, interestingness metrics, <a href="/wiki/Computational_complexity_theory" title="Computational complexity theory">complexity</a> considerations, post-processing of found structure, <a href="/wiki/Data_visualization" title="Data visualization">visualization</a> and <a href="/wiki/Online_algorithm" title="Online algorithm">online updating</a>.<sup id="cite_ref-acm_1-3" class="reference"><a href="#cite_note-acm-1"><span>[</span>2<span>]</span></a></sup></p>

Any ideas how to remove them and have pure text?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眼睛会笑 2024-12-21 13:41:16

这就是你如何使用 lxml （以及可爱的 请求）：

import requests
import lxml.html as lh
from BeautifulSoup import UnicodeDammit

URL = "http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes"
HEADERS = {'User-agent': 'Mozilla/5.0'}

def lhget(*args, **kwargs):
    r = requests.get(*args, **kwargs)
    html = UnicodeDammit(r.content).unicode
    tree = lh.fromstring(html)
    return tree

def remove(el):
    el.getparent().remove(el)

tree = lhget(URL, headers=HEADERS)

el = tree.xpath("//div[@class='mw-content-ltr']/p")[0]

for ref in el.xpath("//sup[@class='reference']"):
    remove(ref)

print lh.tostring(el, pretty_print=True)

print el.text_content()

This is how you could do it with lxml (and the lovely requests):

import requests
import lxml.html as lh
from BeautifulSoup import UnicodeDammit

URL = "http://en.wikipedia.org/w/index.php?title=data_mining&printable=yes"
HEADERS = {'User-agent': 'Mozilla/5.0'}

def lhget(*args, **kwargs):
    r = requests.get(*args, **kwargs)
    html = UnicodeDammit(r.content).unicode
    tree = lh.fromstring(html)
    return tree

def remove(el):
    el.getparent().remove(el)

tree = lhget(URL, headers=HEADERS)

el = tree.xpath("//div[@class='mw-content-ltr']/p")[0]

for ref in el.xpath("//sup[@class='reference']"):
    remove(ref)

print lh.tostring(el, pretty_print=True)

print el.text_content()

回复收藏 0 原文

喜你已久 2024-12-21 13:41:16

for p in paragraphs(text=True):
    print p

此外，您可以使用 api.php 而不是 index.php：

#!/usr/bin/env python
import sys
import time
import urllib, urllib2
import xml.etree.cElementTree as etree

# prepare request
maxattempts = 5 # how many times to try the request before giving up
maxlag = 5 # seconds http://www.mediawiki.org/wiki/Manual:Maxlag_parameter
params = dict(action="query", format="xml", maxlag=maxlag,
              prop="revisions", rvprop="content", rvsection=0,
              titles="data_mining")
request = urllib2.Request(
    "http://en.wikipedia.org/w/api.php?" + urllib.urlencode(params), 
    headers={"User-Agent": "WikiDownloader/1.2",
             "Referer": "http://stackoverflow.com/q/8044814"})
# make request
for _ in range(maxattempts):
    response = urllib2.urlopen(request)
    if response.headers.get('MediaWiki-API-Error') == 'maxlag':
        t = response.headers.get('Retry-After', 5)
        print "retrying in %s seconds" % (t,)
        time.sleep(float(t))
    else:
        break # ready to read
else: # exhausted all attempts
    sys.exit(1)

# download & parse xml 
tree = etree.parse(response)

# find rev data 
rev_data = tree.findtext('.//rev')
if not rev_data:
    print 'MediaWiki-API-Error:', response.headers.get('MediaWiki-API-Error')
    tree.write(sys.stdout)
    print
    sys.exit(1)

print(rev_data)

输出

{{Distinguish|analytics|information extraction|data analysis}}

'''Data mining''' (the analysis step of the '''knowledge discovery in databases..

for p in paragraphs(text=True):
    print p

Additionally you could use api.php instead of index.php:

#!/usr/bin/env python
import sys
import time
import urllib, urllib2
import xml.etree.cElementTree as etree

# prepare request
maxattempts = 5 # how many times to try the request before giving up
maxlag = 5 # seconds http://www.mediawiki.org/wiki/Manual:Maxlag_parameter
params = dict(action="query", format="xml", maxlag=maxlag,
              prop="revisions", rvprop="content", rvsection=0,
              titles="data_mining")
request = urllib2.Request(
    "http://en.wikipedia.org/w/api.php?" + urllib.urlencode(params), 
    headers={"User-Agent": "WikiDownloader/1.2",
             "Referer": "http://stackoverflow.com/q/8044814"})
# make request
for _ in range(maxattempts):
    response = urllib2.urlopen(request)
    if response.headers.get('MediaWiki-API-Error') == 'maxlag':
        t = response.headers.get('Retry-After', 5)
        print "retrying in %s seconds" % (t,)
        time.sleep(float(t))
    else:
        break # ready to read
else: # exhausted all attempts
    sys.exit(1)

# download & parse xml 
tree = etree.parse(response)

# find rev data 
rev_data = tree.findtext('.//rev')
if not rev_data:
    print 'MediaWiki-API-Error:', response.headers.get('MediaWiki-API-Error')
    tree.write(sys.stdout)
    print
    sys.exit(1)

print(rev_data)

Output

{{Distinguish|analytics|information extraction|data analysis}}

'''Data mining''' (the analysis step of the '''knowledge discovery in databases..

回复收藏 0 原文

梦忆晨望 2024-12-21 13:41:16

这些似乎适用于 Beautiful soup 标签节点。父节点被修改，因此相关标签被删除。找到的标签也会作为列表返回给调用者。

@staticmethod
def seperateCommentTags(parentNode):
    commentTags = []
    for descendant in parentNode.descendants:
        if isinstance(descendant, element.Comment):
            commentTags.append(descendant)
    for commentTag in commentTags:
        commentTag.extract()
    return commentTags

@staticmethod
def seperateScriptTags(parentNode):
    scripttags = parentNode.find_all('script')
    scripts = []
    for scripttag in scripttags:
        script = scripttag.extract()
        if script is not None:
            scripts.append(script)
    return scripts

These seem to work on Beautiful soup tag nodes. The parentNode gets modified so the relevant tags are removed. The found tags are also returned as lists back to the caller.

@staticmethod
def seperateCommentTags(parentNode):
    commentTags = []
    for descendant in parentNode.descendants:
        if isinstance(descendant, element.Comment):
            commentTags.append(descendant)
    for commentTag in commentTags:
        commentTag.extract()
    return commentTags

@staticmethod
def seperateScriptTags(parentNode):
    scripttags = parentNode.find_all('script')
    scripts = []
    for scripttag in scripttags:
        script = scripttag.extract()
        if script is not None:
            scripts.append(script)
    return scripts

回复收藏 0 原文

~没有更多了~