如何从刮擦页面中删除字符串的大部分？

发布于 2025-02-10 12:48:04 字数 1579 浏览 1 评论 0原文

我做了一个网络刮板，以获取Wikipedia页面的信息文本。我得到了想要的文字，但我想切断底部文字的很大一部分。我已经尝试了其他一些解决方案，但是有了这些解决方案，我没有得到所需的标题和白色空间。

import requests
from bs4 import BeautifulSoup
import re


website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text, "html.parser")

text = list()

text.extend(soup.findAll('mw-content-text'))

text_content = soup.text
text_content = re.sub(r'==.*?==+', '', text_content)
# text_content = text.replace('\n', '')

print(text_content)

在这里，soup.text是Wikipedia页面的所有文本，class ='MW-Content-Text'以字符串打印。这打印了我需要的整体文本，但我需要切断字符串，在该字符串开始显示源文本。我已经尝试了替换方法，但没有做任何事情。

给定此页面，我想剪切我刮擦的大字符串中的红线下的内容

< img src =“ https://i.sstatic.net/vec7h.png” alt =“带有红线的wikipedia页面屏幕截图”>

我尝试了这样的事情，这是不起作用的：

  for content in soup('span', {'class': 'mw-content-text'}):
      print(content.text)
      text = content.findAll('p', 'a')
      for t in text:
          print(text.text)```

我也尝试了：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import requests

website = urlopen("https://nl.wikipedia.org/wiki/Kat_(dier)").read()
soup = BeautifulSoup(website,'lxml')

text = ''

for content in soup.find_all('p'):
    text += content.text

text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')

# print(text)

但是，这些方法只是给了我一个无法理解的文字。我仍然想要我的基本代码给我的空间和标头。

原文

I made a web scraper to get the informative text of a Wikipedia page. I get the text I want but I want to cut off a big part of the bottom text. I already tried some other solutions but with those, I don't get the headers and white-spaces I need.

import requests
from bs4 import BeautifulSoup
import re


website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text, "html.parser")

text = list()

text.extend(soup.findAll('mw-content-text'))

text_content = soup.text
text_content = re.sub(r'==.*?==+', '', text_content)
# text_content = text.replace('\n', '')

print(text_content)

Here, soup.text is all the text of the wikipedia page with the class='mw-content-text' printed as a string. This prints the overall text I need but I need to cut off the string where it starts showing the text of the sources. I already tried the replace method but it didn't do anything.

Given this page, I want to cut of what's under the red line in the big string of text I have scraped

I tried something like this, which didn't work:

  for content in soup('span', {'class': 'mw-content-text'}):
      print(content.text)
      text = content.findAll('p', 'a')
      for t in text:
          print(text.text)```

I also tried this:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import requests

website = urlopen("https://nl.wikipedia.org/wiki/Kat_(dier)").read()
soup = BeautifulSoup(website,'lxml')

text = ''

for content in soup.find_all('p'):
    text += content.text

text = re.sub(r'\[.*?\]+', '', text)
text = text.replace('\n', '')

# print(text)

but these approaches just gave me an unreadable mess of text. I still want the whitespaces and headers that my base code gives me.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

凉月流沐 2025-02-17 12:48:04

认为它仍然有点

for c in soup.select_one('#mw-content-text > div').find_all(recursive=False):
    if c.get('class') and 'appendix' in c.get('class'):
        break
    print(c.get_text(strip=True))

抽象

import requests
from bs4 import BeautifulSoup    

website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text)

for c in soup.select_one('#mw-content-text > div').find_all(recursive=False):
    if c.get('class') and 'appendix' in c.get('class'):
        break
    print(c.get_text(strip=True))

Think it is still a bit abstract but you could get your goal while iterating over all children and break if tag with class appendix appears:

for c in soup.select_one('#mw-content-text > div').find_all(recursive=False):
    if c.get('class') and 'appendix' in c.get('class'):
        break
    print(c.get_text(strip=True))

Example

import requests
from bs4 import BeautifulSoup    

website = "https://nl.wikipedia.org/wiki/Kat_(dier)"
request = requests.get(website)
soup = BeautifulSoup(request.text)

for c in soup.select_one('#mw-content-text > div').find_all(recursive=False):
    if c.get('class') and 'appendix' in c.get('class'):
        break
    print(c.get_text(strip=True))

回复收藏 0 原文

伤感在游骋 2025-02-17 12:48:04

可能有一个更有效的解决方案，但这里是一个解决您的问题的列表理解：

# the rest of your code
references = [line for line in text_content.split('\n') if line.startswith("↑")]

这是一个更容易理解的替代版本：

# the rest of your code

# Turn text_content into a list of lines
text_content = text_content.split('\n')

references = []

# Iterate through each line and only save the values that start 
# with the symbol used for each reference, on wikipedia: "↑" 
# ( or "^" for english wikipedia pages )

for line in text_content:
    if line.startswith("↑"):
        references.append(line)

两个脚本都会做同样的事情。

There is likely a more efficient solution but here is a list comprehension that solves your issue:

# the rest of your code
references = [line for line in text_content.split('\n') if line.startswith("↑")]

Heres an alternative version that might be easier to understand:

# the rest of your code

# Turn text_content into a list of lines
text_content = text_content.split('\n')

references = []

# Iterate through each line and only save the values that start 
# with the symbol used for each reference, on wikipedia: "↑" 
# ( or "^" for english wikipedia pages )

for line in text_content:
    if line.startswith("↑"):
        references.append(line)

Both scripts will do the same thing.

回复收藏 0 原文

~没有更多了~