第一次从WeatherCast网站刮擦

发布于 2025-02-06 19:52:37 字数 485 浏览 0 评论 0原文

我正在学习Web刮擦作为我的第一个迷你项目。目前与Python合作。我想提取天气数据并使用python来显示我所居住的天气，我通过检查标签找到了所需的数据，但它不断地给我所有的数字我试图编写其特定索引号，但仍然没有用。这是我到目前为止写的代码；

import requests
from bs4 import BeautifulSoup as bs

url= "http://kktcmeteor.org/tahminler/merkezler?m=ISKELE"
r= requests.get(url)

cast = bs(r.content, "lxml")

wthr = cast.findAll("div",{"class": "col-md-9"})
print (wthr)

任何帮助将不胜感激。我想要的数据是温度数据。

还;有人可以向我解释使用LXML或HTML.Parser之间的区别。我已经看到两种方法都被广泛使用，并且很好奇您如何决定使用另一种方法。

原文

I am learning web scraping as my first mini-project. Currently working with python. I want to extract a weather data and use python to show the weather where I am living, I have found the data I needed by inspecting the tags but it keeps giving me all the numbers on the weather forecast table instead of the specific one I need I have tried for to write its specific index number but it still did not work. This is the code I have written so far;

import requests
from bs4 import BeautifulSoup as bs

url= "http://kktcmeteor.org/tahminler/merkezler?m=ISKELE"
r= requests.get(url)

cast = bs(r.content, "lxml")

wthr = cast.findAll("div",{"class": "col-md-9"})
print (wthr)

Any help would be greatly appreciated. The data I want is the Temperature data.

Also; Can somebody explain to me the differences between using lxml or html.parser. I have seen both methods being used widely and was curious how would you decide to use one over the other.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

以为你会在 2025-02-13 19:52:37

在诉讼之前，应考虑刮擦的合法性。您可以在这里找到有关它的信息< aping.htm
该站点没有robots.txt文件，因此可以爬网。
这是一种非常简化的方法，可以在您在问题中使用的URL中发布表数据。这使用html.parser来提取数据，

import requests
from bs4 import BeautifulSoup

def get_soup(my_url):
    HTML = requests.get(my_url) 
    my_soup = BeautifulSoup(HTML.text, 'html.parser') 
    if 'None' not in str(type(my_soup)):
        return my_soup
    else:
        return None

url = "http://kktcmeteor.org/tahminler/merkezler?m=ISKELE"

#   get the whole html document
soup = get_soup(url)

#   get something from that soup
#   here a table header and data are extracted from the soup
table_header = soup.find("table").findAll("th")
table_data = soup.find("table").findAll("td")

#   header's and data's type is list
#   combine lists
for x in range(len(table_header)):
    print(table_header[x].text + ' --> ' + table_data[x].text)

""" R e s u l t :
Tarih / Saat -->

Hava --> COK BULUTLU
Sıcaklık --> 27.5°C
İşba Sıcaklığı --> 17.9°C
Basınç --> 1003.5 hPa
Görüş --> 10 km
Rüzgar --> Batıdan (270) 5 kt.
12.06.2022 13:00 --> Genel Tablo Genel Harita
"""

这只是执行此操作的一种方法，并且仅在网站上的透明表中显示一个部分。再一次，请注意网站的robots.txt文件中所述的说明。问候...

Legality of scraping should be considered before the action. You can find something about it here https://www.tutorialspoint.com/python_web_scraping/legality_of_python_web_scraping.htm
This site doesn't have robots.txt file so it is permitted to crawl.
Here is a very simplified way to get the table data published at the url that you use in the question. This uses html.parser to extract data

import requests
from bs4 import BeautifulSoup

def get_soup(my_url):
    HTML = requests.get(my_url) 
    my_soup = BeautifulSoup(HTML.text, 'html.parser') 
    if 'None' not in str(type(my_soup)):
        return my_soup
    else:
        return None

url = "http://kktcmeteor.org/tahminler/merkezler?m=ISKELE"

#   get the whole html document
soup = get_soup(url)

#   get something from that soup
#   here a table header and data are extracted from the soup
table_header = soup.find("table").findAll("th")
table_data = soup.find("table").findAll("td")

#   header's and data's type is list
#   combine lists
for x in range(len(table_header)):
    print(table_header[x].text + ' --> ' + table_data[x].text)

""" R e s u l t :
Tarih / Saat -->

Hava --> COK BULUTLU
Sıcaklık --> 27.5°C
İşba Sıcaklığı --> 17.9°C
Basınç --> 1003.5 hPa
Görüş --> 10 km
Rüzgar --> Batıdan (270) 5 kt.
12.06.2022 13:00 --> Genel Tablo Genel Harita
"""

This is just one way to do it and it gets just a part shown in a transparent table on the site. Once more, take care of the instructions stated in the robots.txt file of the site. Regards...

回复收藏 0 原文