使用 urllib 导入带有列外行的格式化文本文件
我正在尝试使用 urllib 解析网站中的文本文件并提取数据。我还可以处理其他文件,它们是按列格式化的文本,但这个文件有点让我困惑,因为南伊利诺伊州-爱德华兹维尔的行将第二个得分和位置推出了列。
file = urllib.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')
for line in file:
game_month = line[0:1].rstrip()
game_day = line[2:4].rstrip()
game_year = line[5:9].rstrip()
team1 = line[11:37].rstrip()
team1_scr = line[38:40].rstrip()
team2 = line[42:68].rstrip()
team2_scor = line[68:70].rstrip()
extra_info = line[72:100].rstrip()
南伊利诺伊州-爱德华兹维尔线将“il”导入为 team2_scr,并将“4 @Central Arkansas”导入为 extra_info。
I'm trying to use urllib
to parse a text file from the website and pull in data. There are other files that I have been able to do, they're text formatted in columns, but this one is kind of throwing me because of the line for Southern Illinois-Edwardsville pushes the second score and location out of the column.
file = urllib.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')
for line in file:
game_month = line[0:1].rstrip()
game_day = line[2:4].rstrip()
game_year = line[5:9].rstrip()
team1 = line[11:37].rstrip()
team1_scr = line[38:40].rstrip()
team2 = line[42:68].rstrip()
team2_scor = line[68:70].rstrip()
extra_info = line[72:100].rstrip()
The Southern Illinois-Edwardsville line imports 'il' as team2_scr and imports ' 4 @Central Arkansas' as the extra_info.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
想看看最好的解决方案吗? http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=CSV&submit=Fetch 将给你漂亮的 CSV 文件,不需要黑魔法。
Wanna see the best solution? http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=CSV&submit=Fetch will give you nice CSV file, no dark magic needed.
你想要这样的东西:
输出:
do you want something like this:
output:
显然你只需要分割多个空间。不幸的是,
csv
模块仅允许使用单字符分隔符,但re.sub
可以提供帮助。我会推荐这样的东西:这会产生这样的结果:
或者,如果您愿意,只需使用
cvs.reader
并获取list
而不是dicts:
Clearly you just need to split on multiple spaces. Unfortunately the
csv
module only allows a single-character delimiter, butre.sub
can help. I would recommend something like this:This produces results like this:
Or if you prefer, just use a
cvs.reader
and getlist
s rather thandict
s:假设
s
包含表格的一行。然后,您可以使用re
(正则表达式)库的 split() 方法:...并且 cols 现在是一个字符串列表,每个字符串在表行中为一列。这假设表列至少由两个空格分隔,没有其他分隔符。如果不是这种情况,可以编辑 re.compile() 的参数以允许其他配置。
回想一下,Python 将文件视为由换行符分隔的行序列。因此,您所要做的就是对文件进行 for 循环,将 .split() 应用于每一行。
要获得更好的解决方案,请查看内置的 map() 函数并尝试使用它而不是 for 循环。
Say that
s
contains one row of your table. Then you could use the split() method of there
(regular expressions) library:...and cols is now a list of strings, each a column in your table row. This assumes that table columns are separated by at least two spaces, and nothing else. If that is not the case, the argument to re.compile() can be edited to allow for other configurations.
Recall that Python considers a file a sequence of lines, separated by newline characters. Therefore, all you have to do is to for-loop over your file, applying .split() to each line.
For an even nicer solution, check out the built-in map() function and try using that instead of a for-loop.