将多个 html 文件抓取到 CSV

发布于 2024-07-23 02:02:53 字数 1444 浏览 6 评论 0原文

我正在尝试从硬盘驱动器上的 1200 多个 .htm 文件中删除行。 在我的计算机上,它们位于“file:///home/phi/Data/NHL/pl07-08/PL020001.HTM”。 这些.htm 文件从*20001.htm 到*21230.htm 是连续的。 我的计划是最终通过电子表格应用程序将我的数据放入 MySQL 或 SQLite,或者如果我能从此过程中获得干净的 .csv 文件,则直接放入。

这是我第一次尝试代码(Python)、抓取,我刚刚在我蹩脚的奔腾 IV 上安装了 Ubuntu 9.04。 不用说我是新手并且有一些障碍。

如何让 mechanize 按顺序浏览目录中的所有文件。 机械化能做到这一点吗? 机械化/Python/BeautifulSoup可以读取'file:///'样式的url还是有其他方法将其指向/home/phi/Data/NHL/pl07-08/PL020001.HTM? 以 100 或 250 个文件增量执行此操作还是仅发送所有 1230 个文件是否明智?

我只需要以“”开头并以“”结尾的行。 理想情况下,我只想要其中包含“SHOT”|“MISS”|“GOAL”的行,但我想要整行(每一列)。 请注意,“目标”是粗体的,所以我必须指定它吗? 每个 htm 文件有 3 个表。

另外,我希望父文件(pl020001.htm)的名称包含在我抓取的行中,以便我可以在最终数据库中自己的列中对它们进行标识。 我什至不知道从哪里开始。 这就是我到目前为止所拥有的:

#/usr/bin/python
from BeautifulSoup import BeautifulSoup
import re
from mechanize import Browser

mech = Browser()
url = "file:///home/phi/Data/NHL/pl07-08/PL020001.HTM"
##but how do I do multiple urls/files? PL02*.HTM?
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)
##this confuses me and seems redundant
pl = open("input_file.html","r")
chances = open("chancesforsql.csv,"w")

table = soup.find("table", border=0)
for row in table.findAll 'tr class="evenColor"'
#should I do this instead of before?
outfile = open("shooting.csv", "w")

##how do I end it?

我应该使用 IDLE 还是类似的东西? Ubuntu 9.04 中只有终端吗?

I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get a clean .csv file out of this process.

This is my first attempt at code (Python), scraping, and I just installed Ubuntu 9.04 on my crappy pentium IV. Needless to say I am newb and have some roadblocks.

How do I get mechanize to go through all the files in the directory in order. Can mechanize even do this? Can mechanize/Python/BeautifulSoup read a 'file:///' style url or is there another way to point it to /home/phi/Data/NHL/pl07-08/PL020001.HTM? Is it smart to do this in 100 or 250 file increments or just send all 1230?

I just need rows that start with this "<tr class="evenColor">" and end with this "</tr>". Ideally I only want the rows that contain "SHOT"|"MISS"|"GOAL" within them but I want the whole row (every column). Note that "GOAL" is in bold so do I have to specify this? There are 3 tables per htm file.

Also I would like the name of the parent file (pl020001.htm) to be included in the rows I scrape so I can id them in their own column in the final database. I don't even know where to begin for that. This is what I have so far:

#/usr/bin/python
from BeautifulSoup import BeautifulSoup
import re
from mechanize import Browser

mech = Browser()
url = "file:///home/phi/Data/NHL/pl07-08/PL020001.HTM"
##but how do I do multiple urls/files? PL02*.HTM?
page = mech.open(url)

html = page.read()
soup = BeautifulSoup(html)
##this confuses me and seems redundant
pl = open("input_file.html","r")
chances = open("chancesforsql.csv,"w")

table = soup.find("table", border=0)
for row in table.findAll 'tr class="evenColor"'
#should I do this instead of before?
outfile = open("shooting.csv", "w")

##how do I end it?

Should I be using IDLE or something like it? just Terminal in Ubuntu 9.04?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

云之铃。 2024-07-30 02:02:53

你不需要机械化。 由于我不太了解 HTML 内容,因此我首先尝试查看匹配的内容。 像这样:

import glob
from BeautifulSoup import BeautifulSoup

for filename in glob.glob('/home/phi/Data/*.htm'):
    soup = BeautifulSoup(open(filename, "r").read()) # assuming some HTML
    for a_tr in soup.findAll("tr", attrs={ "class" : "evenColor" }):
        print a_tr

然后选择你想要的东西并用逗号将其写入标准输出(并将其重定向到文件)。 或者通过 python 写入 csv。

You won't need mechanize. Since I do not exactly know the HTML content, I'd try to see what matches, first. Like this:

import glob
from BeautifulSoup import BeautifulSoup

for filename in glob.glob('/home/phi/Data/*.htm'):
    soup = BeautifulSoup(open(filename, "r").read()) # assuming some HTML
    for a_tr in soup.findAll("tr", attrs={ "class" : "evenColor" }):
        print a_tr

Then pick the stuff you want and write it to stdout with commas (and redirect it > to a file). Or write the csv via python.

-柠檬树下少年和吉他 2024-07-30 02:02:53

MYYN 的回答对我来说似乎是一个很好的开始。 我要指出的是,我很幸运的一件事是:

import glob
    for file_name in glob.glob('/home/phi/Data/*.htm'):
        #read the file and then parse with BeautifulSoup

我发现 osglob 导入对于运行文件中的文件非常有用。目录。

此外,一旦您以这种方式使用 for 循环,您就拥有了 file_name ,您可以修改它以在输出文件中使用,以便输出文件名与输入文件名匹配。

MYYN's answer looks like a great start to me. One thing I'd point out that I've had luck with is:

import glob
    for file_name in glob.glob('/home/phi/Data/*.htm'):
        #read the file and then parse with BeautifulSoup

I've found both the os and glob imports to be really useful for running through files in a directory.

Also, once you're using a for loop in this way, you have the file_name which you can modify for use in the output file, so that the output filenames will match the input filenames.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文