无法读取 HTML 数据 - Python
我正在尝试使用 BeautifulSoup for python 解析网站中的 html 数据。但是,urllib2 或 mechanize 无法读取整个 html 格式。返回的数据是
<html>
<head>
<title>
EC 4.1.2.13 - Fructose-bisphosphate aldolase </title>
<meta name="description" content="Information on EC 4.1.2.13 - Fructose-bisphosphate aldolase">
<meta name="keywords" content="EC,Number,Enzyme,Pathway,Reaction,Organism,Substrate,Cofactor,Inhibitor,Compound,KM Value,KI Value,IC50 Value,pi Value,Turnover Number,pH,Temperature,Optimum,Range,Source Tissue,BLAST,Subunits,Modification,Crystallization,Stability,Purification">
</head>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<frameset cols="190,*" border="0">
<frame name="navigation" src="flat_navigation.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475" frameborder="no">
<frameset rows="110,*" border="0">
<frame name="header" src="flat_head.php4?ecno=4.1.2.13" frameborder="no">
<frame name="flat" src="flat_result.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475" frameborder="no">
</frameset>
</frameset>
<noframes>
<body>
<h1>EC 4.1.2.13 - Fructose-bisphosphate aldolase </h1>
<a href="flat_result.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475">More detailed information on the enzyme EC 4.1.2.13 - Fructose-bisphosphate aldolase</a>
Sorry, but your browser doesn't support frames. Please use another browser!
</body>
</noframes>
</html>
当我使用 Internet Explorer 手动打开 webste 时,可以读取整个 html。有没有使用 urllib2、mechanize 或 BeautifulSoup 来解决这个问题?
I am attempting to parse html data from a website using BeautifulSoup for python. However, urllib2 or mechanize is not able to read the whole html format. The returned data is
<html>
<head>
<title>
EC 4.1.2.13 - Fructose-bisphosphate aldolase </title>
<meta name="description" content="Information on EC 4.1.2.13 - Fructose-bisphosphate aldolase">
<meta name="keywords" content="EC,Number,Enzyme,Pathway,Reaction,Organism,Substrate,Cofactor,Inhibitor,Compound,KM Value,KI Value,IC50 Value,pi Value,Turnover Number,pH,Temperature,Optimum,Range,Source Tissue,BLAST,Subunits,Modification,Crystallization,Stability,Purification">
</head>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN" "http://www.w3.org/TR/html4/frameset.dtd">
<frameset cols="190,*" border="0">
<frame name="navigation" src="flat_navigation.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475" frameborder="no">
<frameset rows="110,*" border="0">
<frame name="header" src="flat_head.php4?ecno=4.1.2.13" frameborder="no">
<frame name="flat" src="flat_result.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475" frameborder="no">
</frameset>
</frameset>
<noframes>
<body>
<h1>EC 4.1.2.13 - Fructose-bisphosphate aldolase </h1>
<a href="flat_result.php4?ecno=4.1.2.13&organism_list=Mycobacterium tuberculosis&Suchword=&UniProtAcc=P67475">More detailed information on the enzyme EC 4.1.2.13 - Fructose-bisphosphate aldolase</a>
Sorry, but your browser doesn't support frames. Please use another browser!
</body>
</noframes>
</html>
When I manually open the webste using Internet Explorer the whole html can be read. Is there anyway using urllib2, mechanize, or BeautifulSoup to work around this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
那是因为内容在框架中。您可以解析页面并查找主
元素的
src
属性,也可以直接请求框架。在大多数浏览器中,您可以右键单击并选择“框架属性”等来获取框架的 URL。That's because the content is in the frames. You can either parse the page and look for the
src
attribute of the main<frame>
element or directly request the frame. In most browsers, you can right-click and select "Frame Properties" or so to get the frame's URL.