如何从刮擦页面中删除字符串的大部分?
我做了一个网络刮板,以获取Wikipedia页面的信息文本。我得到了想要的文字,但我想切断底部文字的很大一部分。我已经尝试了其他一些解决方案,但是有了这些解决方案,我没有得到所需的标题和白色空间。
import requests
from bs4 import BeautifulSoup
import re
website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text, "html.parser")
text = list()
text.extend(soup.findAll('mw-content-text'))
text_content = soup.text
text_content = re.sub(r'==.*?==+', '', text_content)
# text_content = text.replace('\n', '')
print(text_content)
在这里,soup.text
是Wikipedia页面的所有文本,class ='MW-Content-Text'
以字符串打印。这打印了我需要的整体文本,但我需要切断字符串,在该字符串开始显示源文本。我已经尝试了替换
方法,但没有做任何事情。
给定此页面,我想剪切我刮擦的大字符串中的红线下的内容
< img src =“ https://i.sstatic.net/vec7h.png” alt =“带有红线的wikipedia页面屏幕截图”>
我尝试了这样的事情,这是不起作用的:
for content in soup('span', {'class': 'mw-content-text'}):
print(content.text)
text = content.findAll('p', 'a')
for t in text:
print(text.text)```
我也尝试了:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import requests
website = urlopen("https://nl.wikipedia.org/wiki/Kat_(dier)").read()
soup = BeautifulSoup(website,'lxml')
text = ''
for content in soup.find_all('p'):
text += content.text
text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')
# print(text)
但是,这些方法只是给了我一个无法理解的文字。我仍然想要我的基本代码给我的空间和标头。
I made a web scraper to get the informative text of a Wikipedia page. I get the text I want but I want to cut off a big part of the bottom text. I already tried some other solutions but with those, I don't get the headers and white-spaces I need.
import requests
from bs4 import BeautifulSoup
import re
website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text, "html.parser")
text = list()
text.extend(soup.findAll('mw-content-text'))
text_content = soup.text
text_content = re.sub(r'==.*?==+', '', text_content)
# text_content = text.replace('\n', '')
print(text_content)
Here, soup.text
is all the text of the wikipedia page with the class='mw-content-text'
printed as a string. This prints the overall text I need but I need to cut off the string where it starts showing the text of the sources. I already tried the replace
method but it didn't do anything.
Given this page, I want to cut of what's under the red line in the big string of text I have scraped
I tried something like this, which didn't work:
for content in soup('span', {'class': 'mw-content-text'}):
print(content.text)
text = content.findAll('p', 'a')
for t in text:
print(text.text)```
I also tried this:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import requests
website = urlopen("https://nl.wikipedia.org/wiki/Kat_(dier)").read()
soup = BeautifulSoup(website,'lxml')
text = ''
for content in soup.find_all('p'):
text += content.text
text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')
# print(text)
but these approaches just gave me an unreadable mess of text. I still want the whitespaces and headers that my base code gives me.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
认为它仍然有点
抽象
Think it is still a bit abstract but you could get your goal while iterating over all children and
break
if tag withclass
appendix appears:Example
可能有一个更有效的解决方案,但这里是一个解决您的问题的列表理解:
这是一个更容易理解的替代版本:
两个脚本都会做同样的事情。
There is likely a more efficient solution but here is a list comprehension that solves your issue:
Heres an alternative version that might be easier to understand:
Both scripts will do the same thing.