Web Crapping返回外语，尽管一切都用英语

发布于 2025-02-09 07:44:05 字数 1526 浏览 1 评论 0原文

我对Python中的Web Crapping非常陌生，代码中没有错误，但是OUT似乎是正确的，但是问题在于它的语言是Ouptput。因此，我尝试了IMDB的流行网站。我检查HTML代码，我想提取电影的名称，等级等。这是IMBD的网站，有250部电影和评级

# We use the request module to access the website IMDB
   source = requests.get('https://www.imdb.com/chart/top/')
# Let capture error say if the website address having some issues 
   source.raise_for_status()
   # The following will return html parser code, 
   soup = BeautifulSoup(source.text, 'html.parser')
   movies = soup.find('tbody', class_= 'lister-list').find_all('tr')
   #print(len(movies))
   # Let iterate through each tr tag 
   for movie in movies:
     
      # Use break to check only the first element of the list 
      #break
       name = movie.find('td', class_='titleColumn').a.text

       rank = movie.find('td', class_='titleColumn').get_text(strip=True).split('.')[0]

       year = movie.find('td', class_='titleColumn').span.text.strip('()')

       rating = movie.find('td', class_ ="ratingColumn imdbRating").strong.text

       print(name, rank, year, rating)

网站上的所有内容是英语，我的输出如何是外语？

输出是以下

刺激1995 1 1994 9.2
教父 2 1972 9.2
黑暗騎士 3 2008 9.0
教父第二集 4 1974 9.0
十二怒漢 5 1957 8.9
辛德勒的名單 6 1993 8.9
魔戒三部曲：王者再臨 7 2003 8.9
黑色追緝令 8 1994 8.9
魔戒首部曲：魔戒現身 9 2001 8.8
黃昏三鏢客 10 1966 8.8
阿甘正傳 11 1994 8.8
鬥陣俱樂部 12 1999 8.7
全面啟動 13 2010 8.7
魔戒二部曲：雙城奇謀 14 2002 8.7
星際大戰五部曲：帝國大反擊 15 1980 8.7
駭客任務 16 1999 8.7
四海好傢伙 17 1990 8.7
飛越杜鵑窩 18 1975 8.6
火線追緝令 19 1995 8.6
七武士 20 1954 8.6
風雲人物 21 1946 8.6
沉默的羔羊 22 1991 8.6

原文

I am very new to Webscrapping in python, I have no error in the code but the out seems to be correct but the problem is with the language it's ouptput. So I tried my hand with IMDB the popular website. I inspect the html code, I want to extract the name of the movie, rating, etc.
This is the website for IMBD with 250 movies and rating
https://www.imdb.com/chart/top/
My code to scrape the data as follows, I use the module, BeautifulSoup and request

# We use the request module to access the website IMDB
   source = requests.get('https://www.imdb.com/chart/top/')
# Let capture error say if the website address having some issues 
   source.raise_for_status()
   # The following will return html parser code, 
   soup = BeautifulSoup(source.text, 'html.parser')
   movies = soup.find('tbody', class_= 'lister-list').find_all('tr')
   #print(len(movies))
   # Let iterate through each tr tag 
   for movie in movies:
     
      # Use break to check only the first element of the list 
      #break
       name = movie.find('td', class_='titleColumn').a.text

       rank = movie.find('td', class_='titleColumn').get_text(strip=True).split('.')[0]

       year = movie.find('td', class_='titleColumn').span.text.strip('()')

       rating = movie.find('td', class_ ="ratingColumn imdbRating").strong.text

       print(name, rank, year, rating)

Everything in the website is English how come my output is foreign language?

The output is the following

刺激1995 1 1994 9.2
教父 2 1972 9.2
黑暗騎士 3 2008 9.0
教父第二集 4 1974 9.0
十二怒漢 5 1957 8.9
辛德勒的名單 6 1993 8.9
魔戒三部曲：王者再臨 7 2003 8.9
黑色追緝令 8 1994 8.9
魔戒首部曲：魔戒現身 9 2001 8.8
黃昏三鏢客 10 1966 8.8
阿甘正傳 11 1994 8.8
鬥陣俱樂部 12 1999 8.7
全面啟動 13 2010 8.7
魔戒二部曲：雙城奇謀 14 2002 8.7
星際大戰五部曲：帝國大反擊 15 1980 8.7
駭客任務 16 1999 8.7
四海好傢伙 17 1990 8.7
飛越杜鵑窩 18 1975 8.6
火線追緝令 19 1995 8.6
七武士 20 1954 8.6
風雲人物 21 1946 8.6
沉默的羔羊 22 1991 8.6

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

翻身的咸鱼 2025-02-16 07:44:05

解决方案

您可以在请求之前向标题添加Accept-Language。

headers = {'Accept-Language': 'en-US,en;q=0.5'}

source = requests.get('https://www.imdb.com/chart/top/', headers=headers)

说明

接受语言是HTTP标头，它指示客户端喜欢的语言和语言（根据接受语言MDN文档）。通过添加此标头，您可以告诉服务器您需要使用英语（US）语言和语言来响应。因此，如果服务器支持该语言，并且还使用了此标头，则将获得所需的内容。
因为标题是一个键值变量，所以Python 请求支持通过使用Python dict来定义它。它是可选的，您可以按照以下文档添加： - 自定义标题

Solution

You can add Accept-Language to your header before requesting.

headers = {'Accept-Language': 'en-US,en;q=0.5'}

source = requests.get('https://www.imdb.com/chart/top/', headers=headers)

Explanation

Accept-Language is an HTTP header, which indicates the language and locale that the client prefers (according to Accept-Language MDN docs). By adding this header, you're telling the server that you need response with English (US) language and locale. Therefore, if the server supports that language, and also utilizes this header, you will get what you need.
Because headers is a key-value variable, python requests supports to define it by using python dict. It's optional, and you can add it by following this documentation: Python Requests - Custom Headers

回复收藏 0 原文