从 HTML 标头中抓取值并在 Python 中保存为 CSV 文件
总之,
我刚刚开始使用 Python(v 2.7.1),我的第一个程序之一是尝试使用标准库和 BeautifulSoup 来处理 HTML 元素,从包含电站数据的网站中抓取信息。
我想要访问的数据可以在 HTML 的“Head”部分或主体中的表格中获取。如果单击 CSV 链接,网站将根据其数据生成 CSV 文件。
使用该网站上的几个来源,我设法拼凑了下面的代码,该代码将提取数据并将其保存到文件中,但是,它包含 \n 指示符。尽我所能,我无法保存正确的 CSV 文件。
我确信这很简单,但如果可能的话需要一些帮助!
from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os
from string import replace
bm_url = 'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=T_COTPS-4¶m2=¶m3=¶m4=¶m5=2011-02-05¶m6=*'
data = urllib2.urlopen(bm_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('head',limit=1))
data = replace(data,'[<head>','')
data = replace(data,'<script language="JavaScript" src="/bwx_generic.js"></script>','')
data = replace(data,'<link rel="stylesheet" type="text/css" href="/bwx_style.css" />','')
data = replace(data,'<title>Historic Physical Balancing Mechanism Data</title>','')
data = replace(data,'<script language="JavaScript">','')
data = replace(data,' </script>','')
data = replace(data,'</head>]','')
data = replace(data,'var gs_csv=','')
data = replace(data,'"','')
data = replace(data,"'",'')
data = data.strip()
file_location = 'c:/temp/'
file_name = file_location + 'DataExtract.txt'
file = open(file_name,"wb")
file.write(data)
file.close()
All,
I've just started using Python (v 2.7.1) and one of my first programs is trying to scrape information from a website containing power station data using the Standard Library and BeautifulSoup to handle the HTML elements.
The data I'd like to access is obtainable in either the 'Head' section of the HTML or as tables within the main body. The website will generate a CSV file from it data if the CSV link is clicked.
Using a couple of sources on this website I've managed to cobble together the code below which will pull the data out and save it to a file, but, it contains the \n designators. Try as I might, I can't get a correct CSV file to save out.
I am sure it's something simple but need a bit of help if possible!
from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os
from string import replace
bm_url = 'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=T_COTPS-4¶m2=¶m3=¶m4=¶m5=2011-02-05¶m6=*'
data = urllib2.urlopen(bm_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('head',limit=1))
data = replace(data,'[<head>','')
data = replace(data,'<script language="JavaScript" src="/bwx_generic.js"></script>','')
data = replace(data,'<link rel="stylesheet" type="text/css" href="/bwx_style.css" />','')
data = replace(data,'<title>Historic Physical Balancing Mechanism Data</title>','')
data = replace(data,'<script language="JavaScript">','')
data = replace(data,' </script>','')
data = replace(data,'</head>]','')
data = replace(data,'var gs_csv=','')
data = replace(data,'"','')
data = replace(data,"'",'')
data = data.strip()
file_location = 'c:/temp/'
file_name = file_location + 'DataExtract.txt'
file = open(file_name,"wb")
file.write(data)
file.close()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不要将其转回字符串然后使用替换。这完全违背了使用 BeautifulSoup 的意义!
尝试这样开始:
然后您可以使用:
partition('=')[2]
来切断“var gs_csv”位。strip(' \n"')
用于删除两端不需要的字符(空格、换行符、"
)replace("\\n","\n ")
来整理新行。顺便说一句,replace是一个字符串方法,所以你不必单独导入它,你可以直接执行
data.replace(...
。最后,你需要将它分离为csv。你可以保存并重新打开它,然后将其加载到 csv.reader 中。您可以使用 StringIO 模块将其转换为可以直接提供给 csv.reader 的内容(即无需先保存文件)。但我认为这些数据足够简单,您可以这样做:
Don't turn it back into a string and then use replace. That completely defeats the point of using BeautifulSoup!
Try starting like this:
Then you can use:
partition('=')[2]
to cut off the "var gs_csv" bit.strip(' \n"')
to remove unwanted characters at each end (space, newline,"
)replace("\\n","\n")
to sort out the new lines.Incidentally, replace is a string method, so you don't have to import it separately, you can just do
data.replace(...
.Finally, you need to separate it as csv. You could save it and reopen it, then load it into a csv.reader. You could use the
StringIO
module to turn it into something you can feed directly to csv.reader (i.e. without saving a file first). But I think this data is simple enough that you can get away with doing:解决方案
SOLUTION