Web Crapping返回外语,尽管一切都用英语
我对Python中的Web Crapping非常陌生,代码中没有错误,但是OUT似乎是正确的,但是问题在于它的语言是Ouptput。因此,我尝试了IMDB的流行网站。我检查HTML代码,我想提取电影的名称,等级等。 这是IMBD的网站,有250部电影和评级
# We use the request module to access the website IMDB
source = requests.get('https://www.imdb.com/chart/top/')
# Let capture error say if the website address having some issues
source.raise_for_status()
# The following will return html parser code,
soup = BeautifulSoup(source.text, 'html.parser')
movies = soup.find('tbody', class_= 'lister-list').find_all('tr')
#print(len(movies))
# Let iterate through each tr tag
for movie in movies:
# Use break to check only the first element of the list
#break
name = movie.find('td', class_='titleColumn').a.text
rank = movie.find('td', class_='titleColumn').get_text(strip=True).split('.')[0]
year = movie.find('td', class_='titleColumn').span.text.strip('()')
rating = movie.find('td', class_ ="ratingColumn imdbRating").strong.text
print(name, rank, year, rating)
网站上的所有内容是英语,我的输出如何是外语?
输出是以下
刺激1995 1 1994 9.2
教父 2 1972 9.2
黑暗騎士 3 2008 9.0
教父第二集 4 1974 9.0
十二怒漢 5 1957 8.9
辛德勒的名單 6 1993 8.9
魔戒三部曲:王者再臨 7 2003 8.9
黑色追緝令 8 1994 8.9
魔戒首部曲:魔戒現身 9 2001 8.8
黃昏三鏢客 10 1966 8.8
阿甘正傳 11 1994 8.8
鬥陣俱樂部 12 1999 8.7
全面啟動 13 2010 8.7
魔戒二部曲:雙城奇謀 14 2002 8.7
星際大戰五部曲:帝國大反擊 15 1980 8.7
駭客任務 16 1999 8.7
四海好傢伙 17 1990 8.7
飛越杜鵑窩 18 1975 8.6
火線追緝令 19 1995 8.6
七武士 20 1954 8.6
風雲人物 21 1946 8.6
沉默的羔羊 22 1991 8.6
I am very new to Webscrapping in python, I have no error in the code but the out seems to be correct but the problem is with the language it's ouptput. So I tried my hand with IMDB the popular website. I inspect the html code, I want to extract the name of the movie, rating, etc.
This is the website for IMBD with 250 movies and rating
https://www.imdb.com/chart/top/
My code to scrape the data as follows, I use the module, BeautifulSoup and request
# We use the request module to access the website IMDB
source = requests.get('https://www.imdb.com/chart/top/')
# Let capture error say if the website address having some issues
source.raise_for_status()
# The following will return html parser code,
soup = BeautifulSoup(source.text, 'html.parser')
movies = soup.find('tbody', class_= 'lister-list').find_all('tr')
#print(len(movies))
# Let iterate through each tr tag
for movie in movies:
# Use break to check only the first element of the list
#break
name = movie.find('td', class_='titleColumn').a.text
rank = movie.find('td', class_='titleColumn').get_text(strip=True).split('.')[0]
year = movie.find('td', class_='titleColumn').span.text.strip('()')
rating = movie.find('td', class_ ="ratingColumn imdbRating").strong.text
print(name, rank, year, rating)
Everything in the website is English how come my output is foreign language?
The output is the following
刺激1995 1 1994 9.2
教父 2 1972 9.2
黑暗騎士 3 2008 9.0
教父第二集 4 1974 9.0
十二怒漢 5 1957 8.9
辛德勒的名單 6 1993 8.9
魔戒三部曲:王者再臨 7 2003 8.9
黑色追緝令 8 1994 8.9
魔戒首部曲:魔戒現身 9 2001 8.8
黃昏三鏢客 10 1966 8.8
阿甘正傳 11 1994 8.8
鬥陣俱樂部 12 1999 8.7
全面啟動 13 2010 8.7
魔戒二部曲:雙城奇謀 14 2002 8.7
星際大戰五部曲:帝國大反擊 15 1980 8.7
駭客任務 16 1999 8.7
四海好傢伙 17 1990 8.7
飛越杜鵑窩 18 1975 8.6
火線追緝令 19 1995 8.6
七武士 20 1954 8.6
風雲人物 21 1946 8.6
沉默的羔羊 22 1991 8.6
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
解决方案
您可以在请求之前向标题添加
Accept-Language
。说明
接受语言
是HTTP标头,它指示客户端喜欢的语言和语言(根据接受语言MDN文档)。通过添加此标头,您可以告诉服务器您需要使用英语(US)语言和语言来响应。因此,如果服务器支持该语言,并且还使用了此标头,则将获得所需的内容。标题
是一个键值变量,所以Python请求
支持通过使用Pythondict
来定义它。它是可选的,您可以按照以下文档添加: - 自定义标题Solution
You can add
Accept-Language
to your header before requesting.Explanation
Accept-Language
is an HTTP header, which indicates the language and locale that the client prefers (according to Accept-Language MDN docs). By adding this header, you're telling the server that you need response with English (US) language and locale. Therefore, if the server supports that language, and also utilizes this header, you will get what you need.headers
is a key-value variable, pythonrequests
supports to define it by using pythondict
. It's optional, and you can add it by following this documentation: Python Requests - Custom Headers我认为您的IP位于中国? IMBD有可能进行地理位置并将您的语言设置为普通话。
您对这个人有同样的问题,我认为同样的答案也适用。将标题添加到您的请求中,并将语言设置为英语。
python使用请求更改接受语言
I assume that your IP is located in China? There is a chance that IMBD does geo-location and set your language to Mandarin.
You have the same problem with this person, and I think the same answer apply. Add an header to your request and set the language to English.
Python change Accept-Language using requests