从 HTML 标头中抓取值并在 Python 中保存为 CSV 文件

发布于 2024-10-16 03:31:57 字数 1428 浏览 3 评论 0原文

总之,

我刚刚开始使用 Python(v 2.7.1),我的第一个程序之一是尝试使用标准库和 BeautifulSoup 来处理 HTML 元素,从包含电站数据的网站中抓取信息。

我想要访问的数据可以在 HTML 的“Head”部分或主体中的表格中获取。如果单击 CSV 链接,网站将根据其数据生成 CSV 文件。

使用该网站上的几个来源,我设法拼凑了下面的代码,该代码将提取数据并将其保存到文件中,但是,它包含 \n 指示符。尽我所能,我无法保存正确的 CSV 文件。

我确信这很简单,但如果可能的话需要一些帮助!

from BeautifulSoup import BeautifulSoup

import urllib2,string,csv,sys,os
from string import replace

bm_url = 'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=T_COTPS-4&param2=&param3=&param4=&param5=2011-02-05&param6=*'

data = urllib2.urlopen(bm_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('head',limit=1))

data = replace(data,'[<head>','')
data = replace(data,'<script language="JavaScript" src="/bwx_generic.js"></script>','')
data = replace(data,'<link rel="stylesheet" type="text/css" href="/bwx_style.css" />','')
data = replace(data,'<title>Historic Physical Balancing Mechanism Data</title>','')
data = replace(data,'<script language="JavaScript">','')
data = replace(data,' </script>','')
data = replace(data,'</head>]','')
data = replace(data,'var gs_csv=','')
data = replace(data,'"','')
data = replace(data,"'",'')
data = data.strip()

file_location = 'c:/temp/'
file_name = file_location + 'DataExtract.txt'

file = open(file_name,"wb")
file.write(data)
file.close()

All,

I've just started using Python (v 2.7.1) and one of my first programs is trying to scrape information from a website containing power station data using the Standard Library and BeautifulSoup to handle the HTML elements.

The data I'd like to access is obtainable in either the 'Head' section of the HTML or as tables within the main body. The website will generate a CSV file from it data if the CSV link is clicked.

Using a couple of sources on this website I've managed to cobble together the code below which will pull the data out and save it to a file, but, it contains the \n designators. Try as I might, I can't get a correct CSV file to save out.

I am sure it's something simple but need a bit of help if possible!

from BeautifulSoup import BeautifulSoup

import urllib2,string,csv,sys,os
from string import replace

bm_url = 'http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1=T_COTPS-4¶m2=¶m3=¶m4=¶m5=2011-02-05¶m6=*'

data = urllib2.urlopen(bm_url).read()
soup = BeautifulSoup(data)
data = str(soup.findAll('head',limit=1))

data = replace(data,'[<head>','')
data = replace(data,'<script language="JavaScript" src="/bwx_generic.js"></script>','')
data = replace(data,'<link rel="stylesheet" type="text/css" href="/bwx_style.css" />','')
data = replace(data,'<title>Historic Physical Balancing Mechanism Data</title>','')
data = replace(data,'<script language="JavaScript">','')
data = replace(data,' </script>','')
data = replace(data,'</head>]','')
data = replace(data,'var gs_csv=','')
data = replace(data,'"','')
data = replace(data,"'",'')
data = data.strip()

file_location = 'c:/temp/'
file_name = file_location + 'DataExtract.txt'

file = open(file_name,"wb")
file.write(data)
file.close()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

从来不烧饼 2024-10-23 03:31:57

不要将其转回字符串然后使用替换。这完全违背了使用 BeautifulSoup 的意义!

尝试这样开始:

scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]

然后您可以使用:

  1. partition('=')[2] 来切断“var gs_csv”位。
  2. strip(' \n"') 用于删除两端不需要的字符(空格、换行符、"
  3. replace("\\n","\n ") 来整理新行。

顺便说一句,replace是一个字符串方法,所以你不必单独导入它,你可以直接执行data.replace(...

最后,你需要将它分离为csv。你可以保存并重新打开它,然后将其加载到 csv.reader 中。您可以使用 StringIO 模块将其转换为可以直接提供给 csv.reader 的内容(即无需先保存文件)。但我认为这些数据足够简单,您可以这样做:

for line in data.splitlines():
    row = line.split(",")

Don't turn it back into a string and then use replace. That completely defeats the point of using BeautifulSoup!

Try starting like this:

scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]

Then you can use:

  1. partition('=')[2] to cut off the "var gs_csv" bit.
  2. strip(' \n"') to remove unwanted characters at each end (space, newline, ")
  3. replace("\\n","\n") to sort out the new lines.

Incidentally, replace is a string method, so you don't have to import it separately, you can just do data.replace(....

Finally, you need to separate it as csv. You could save it and reopen it, then load it into a csv.reader. You could use the StringIO module to turn it into something you can feed directly to csv.reader (i.e. without saving a file first). But I think this data is simple enough that you can get away with doing:

for line in data.splitlines():
    row = line.split(",")
聊慰 2024-10-23 03:31:57

解决方案

from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os,time

bm_url_stem = "http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1="
bm_station = "T_COTPS-3"
bm_param = "¶m2=¶m3=¶m4=¶m5="
bm_date = "2011-02-04"
bm_param6 = "¶m6=*"

bm_full_url = bm_url_stem + bm_station + bm_param + bm_date + bm_param6

data = urllib2.urlopen(bm_full_url).read()
soup = BeautifulSoup(data)
scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]
javascriptdata = javascriptdata.partition('=')[2]
javascriptdata = javascriptdata.strip(' \n"')
javascriptdata = javascriptdata.replace("\\n","\n")
javascriptdata = javascriptdata.strip()

csvwriter = csv.writer(file("c:/temp/" + bm_station + "_" + bm_date + ".csv", "wb"))

for line in javascriptdata.splitlines():
row = line.split(",")
csvwriter.writerow(row)

del csvwriter

SOLUTION

from BeautifulSoup import BeautifulSoup
import urllib2,string,csv,sys,os,time

bm_url_stem = "http://www.bmreports.com/servlet/com.logica.neta.bwp_PanBMDataServlet?param1="
bm_station = "T_COTPS-3"
bm_param = "¶m2=¶m3=¶m4=¶m5="
bm_date = "2011-02-04"
bm_param6 = "¶m6=*"

bm_full_url = bm_url_stem + bm_station + bm_param + bm_date + bm_param6

data = urllib2.urlopen(bm_full_url).read()
soup = BeautifulSoup(data)
scripttag = soup.head.findAll("script")[1]
javascriptdata = scripttag.contents[0]
javascriptdata = javascriptdata.partition('=')[2]
javascriptdata = javascriptdata.strip(' \n"')
javascriptdata = javascriptdata.replace("\\n","\n")
javascriptdata = javascriptdata.strip()

csvwriter = csv.writer(file("c:/temp/" + bm_station + "_" + bm_date + ".csv", "wb"))

for line in javascriptdata.splitlines():
row = line.split(",")
csvwriter.writerow(row)

del csvwriter
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文