使用 urllib 导入带有列外行的格式化文本文件

发布于 2024-11-16 01:49:17 字数 687 浏览 1 评论 0原文

我正在尝试使用 urllib 解析网站中的文本文件并提取数据。我还可以处理其他文件，它们是按列格式化的文本，但这个文件有点让我困惑，因为南伊利诺伊州-爱德华兹维尔的行将第二个得分和位置推出了列。

file = urllib.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')

for line in file:
    game_month = line[0:1].rstrip()
    game_day   = line[2:4].rstrip()
    game_year  = line[5:9].rstrip()
    team1      = line[11:37].rstrip()
    team1_scr  = line[38:40].rstrip()
    team2      = line[42:68].rstrip()
    team2_scor = line[68:70].rstrip()
    extra_info = line[72:100].rstrip()

南伊利诺伊州-爱德华兹维尔线将“il”导入为 team2_scr，并将“4 @Central Arkansas”导入为 extra_info。

原文

I'm trying to use urllib to parse a text file from the website and pull in data. There are other files that I have been able to do, they're text formatted in columns, but this one is kind of throwing me because of the line for Southern Illinois-Edwardsville pushes the second score and location out of the column.

file = urllib.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')

for line in file:
    game_month = line[0:1].rstrip()
    game_day   = line[2:4].rstrip()
    game_year  = line[5:9].rstrip()
    team1      = line[11:37].rstrip()
    team1_scr  = line[38:40].rstrip()
    team2      = line[42:68].rstrip()
    team2_scor = line[68:70].rstrip()
    extra_info = line[72:100].rstrip()

The Southern Illinois-Edwardsville line imports 'il' as team2_scr and imports ' 4 @Central Arkansas' as the extra_info.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

独行侠 2024-11-23 01:49:17

想看看最好的解决方案吗？ http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=CSV&submit=Fetch 将给你漂亮的 CSV 文件，不需要黑魔法。

回复收藏 0 原文

暮年 2024-11-23 01:49:17

你想要这样的东西：

def get_row(row):
    row=row.split()
    num_pos=[]
    for i in range(len(row)):
        try:
            int(row[i])
            num_pos.append(i)
        except:
            pass
    assert(len(num_pos)==2)
    ans=[]
    ans.append(row[0])
    ans.append("".join(row[1:num_pos[0]]))
    ans.append(int(row[num_pos[0]]))
    ans.append("".join(row[num_pos[0]+1:num_pos[1]]))
    ans.append(int(row[num_pos[1]]))
    ans.append("".join(row[num_pos[1]+1:]))
    return ans


row1="2/18/2011  Central Arkansas           5  Southern Illinois-Edwardsville  4  @Central Arkansas"
row2="2/18/2011  Central Florida           11  Siena                      1  @Central Florida"

print get_row(row1)
print get_row(row2)

输出：

['2/18/2011', 'CentralArkansas', 5, 'SouthernIllinois-Edwardsville', 4, '@CentralArkansas']
['2/18/2011', 'CentralFlorida', 11, 'Siena', 1, '@CentralFlorida']

do you want something like this:

def get_row(row):
    row=row.split()
    num_pos=[]
    for i in range(len(row)):
        try:
            int(row[i])
            num_pos.append(i)
        except:
            pass
    assert(len(num_pos)==2)
    ans=[]
    ans.append(row[0])
    ans.append("".join(row[1:num_pos[0]]))
    ans.append(int(row[num_pos[0]]))
    ans.append("".join(row[num_pos[0]+1:num_pos[1]]))
    ans.append(int(row[num_pos[1]]))
    ans.append("".join(row[num_pos[1]+1:]))
    return ans


row1="2/18/2011  Central Arkansas           5  Southern Illinois-Edwardsville  4  @Central Arkansas"
row2="2/18/2011  Central Florida           11  Siena                      1  @Central Florida"

print get_row(row1)
print get_row(row2)

output:

['2/18/2011', 'CentralArkansas', 5, 'SouthernIllinois-Edwardsville', 4, '@CentralArkansas']
['2/18/2011', 'CentralFlorida', 11, 'Siena', 1, '@CentralFlorida']

回复收藏 0 原文

哭泣的笑容 2024-11-23 01:49:17

显然你只需要分割多个空间。不幸的是，csv 模块仅允许使用单字符分隔符，但 re.sub 可以提供帮助。我会推荐这样的东西：

import urllib2
import csv
import re

u = urllib2.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')

reader = csv.DictReader((re.sub(' {2,}', '\t', line) for line in u), delimiter='\t', fieldnames=('date', 'team1', 'team1_score', 'team2', 'team2_score', 'extra_info'))

for i, row in enumerate(reader):
    if i == 5: break  # Only do five (otherwise you don't need ``enumerate()``)
    print row

这会产生这样的结果：

{'team1': 'Air Force', 'team2': 'Missouri State', 'date': '2/18/2011', 'team2_score': '2', 'team1_score': '7', 'extra_info': '@neutral'}
{'team1': 'Akron', 'team2': 'Lamar', 'date': '2/18/2011', 'team2_score': '1', 'team1_score': '2', 'extra_info': '@neutral'}
{'team1': 'Alabama', 'team2': 'Alcorn State', 'date': '2/18/2011', 'team2_score': '0', 'team1_score': '11', 'extra_info': '@Alabama'}
{'team1': 'Alabama State', 'team2': 'Tuskegee', 'date': '2/18/2011', 'team2_score': '5', 'team1_score': '9', 'extra_info': '@Alabama State'}
{'team1': 'Appalachian State', 'team2': 'Maryland-Eastern Shore', 'date': '2/18/2011', 'team2_score': '0', 'team1_score': '4', 'extra_info': '@Appalachian State'}

或者，如果您愿意，只需使用 cvs.reader 并获取 list 而不是 dicts：

reader = csv.reader((re.sub(' {2,}', '\t', line) for line in u), delimiter='\t')

print reader.next()

Clearly you just need to split on multiple spaces. Unfortunately the csv module only allows a single-character delimiter, but re.sub can help. I would recommend something like this:

import urllib2
import csv
import re

u = urllib2.urlopen('http://www.boydsworld.com/cgi/scores.pl?team1=all&team2=all&firstyear=2011&lastyear=2011&format=Text&submit=Fetch')

reader = csv.DictReader((re.sub(' {2,}', '\t', line) for line in u), delimiter='\t', fieldnames=('date', 'team1', 'team1_score', 'team2', 'team2_score', 'extra_info'))

for i, row in enumerate(reader):
    if i == 5: break  # Only do five (otherwise you don't need ``enumerate()``)
    print row

This produces results like this:

{'team1': 'Air Force', 'team2': 'Missouri State', 'date': '2/18/2011', 'team2_score': '2', 'team1_score': '7', 'extra_info': '@neutral'}
{'team1': 'Akron', 'team2': 'Lamar', 'date': '2/18/2011', 'team2_score': '1', 'team1_score': '2', 'extra_info': '@neutral'}
{'team1': 'Alabama', 'team2': 'Alcorn State', 'date': '2/18/2011', 'team2_score': '0', 'team1_score': '11', 'extra_info': '@Alabama'}
{'team1': 'Alabama State', 'team2': 'Tuskegee', 'date': '2/18/2011', 'team2_score': '5', 'team1_score': '9', 'extra_info': '@Alabama State'}
{'team1': 'Appalachian State', 'team2': 'Maryland-Eastern Shore', 'date': '2/18/2011', 'team2_score': '0', 'team1_score': '4', 'extra_info': '@Appalachian State'}

Or if you prefer, just use a cvs.reader and get lists rather than dicts:

reader = csv.reader((re.sub(' {2,}', '\t', line) for line in u), delimiter='\t')

print reader.next()

回复收藏 0 原文

懵少女 2024-11-23 01:49:17

假设 s 包含表格的一行。然后，您可以使用 re（正则表达式）库的 split() 方法：

import re
rexp = re.compile('  +')  # Match two or more spaces
cols = rexp.split(s)

...并且 cols 现在是一个字符串列表，每个字符串在表行中为一列。这假设表列至少由两个空格分隔，没有其他分隔符。如果不是这种情况，可以编辑 re.compile() 的参数以允许其他配置。

回想一下，Python 将文件视为由换行符分隔的行序列。因此，您所要做的就是对文件进行 for 循环，将 .split() 应用于每一行。

要获得更好的解决方案，请查看内置的 map() 函数并尝试使用它而不是 for 循环。

Say that s contains one row of your table. Then you could use the split() method of the re (regular expressions) library:

import re
rexp = re.compile('  +')  # Match two or more spaces
cols = rexp.split(s)

...and cols is now a list of strings, each a column in your table row. This assumes that table columns are separated by at least two spaces, and nothing else. If that is not the case, the argument to re.compile() can be edited to allow for other configurations.

Recall that Python considers a file a sequence of lines, separated by newline characters. Therefore, all you have to do is to for-loop over your file, applying .split() to each line.

For an even nicer solution, check out the built-in map() function and try using that instead of a for-loop.

回复收藏 0 原文

~没有更多了~