Python读取csv问题：额外的空格

发布于 2024-10-25 03:23:34 字数 1920 浏览 1 评论 0原文

当我尝试解析由 MS SQL 2005 Express 版本的查询导出的 csv 时，python 给出的字符串完全出乎我的意料。例如，如果 csv 文件中的行是：" aaa,bbb,ccc,dddd”，然后当 python 将其解析为字符串时，它变成：“aaaa , bbb , ccc, ddd d”之类的东西......会发生什么？？？
我尝试删除代码中的空格但不起作用。

import os
import random


f1 = open('a.txt', 'r')
f2 = open('dec_sql.txt', 'w')
text = 'abc'

while(text  != ''):

    text = f1.readline()
    if(text==''):
        break

    splited = text.split(',')
    for i in range (0, 32):
        splited[i] = splited[i].replace(' ', '')

    sql = 'insert into dbo.INBOUND_RATED_DEC2010 values ('
    sql += '\'' + splited[0] + '\', '
    sql += '\'' + splited[1] + '\', '
    sql += '\'' + splited[2] + '\', '
    sql += '\'' + splited[3] + '\', '
    sql += '\'' + splited[4] + '\', '
    sql += '\'' + splited[5] + '\', '
    sql += '\'' + splited[6] + '\', '
    sql += '\'' + splited[7] + '\', '
    sql += '\'' + splited[8] + '\', '
    sql += '\'' + splited[9] + '\', '
    sql += '\'' + splited[10] + '\', '
    sql += '\'' + splited[11] + '\', '
    sql += '\'' + splited[12] + '\', '
    sql += '\'' + splited[13] + '\', '
    sql += '\'' + splited[14] + '\', '
    sql += '\'' + splited[15] + '\', '
    sql += '\'' + splited[16] + '\', '
    sql += '\'' + splited[17] + '\', '
    sql += '\'' + splited[18] + '\', '
    sql += '\'' + splited[19] + '\', '
    sql += '\'' + splited[20] + '\', '
    sql += '\'' + splited[21] + '\', '
    sql += '\'' + splited[22] + '\', '
    sql += '\'' + splited[23] + '\', '
    sql += '\'' + splited[24] + '\', '
    sql += '\'' + splited[25] + '\', '
    sql += '\'' + splited[26] + '\', '
    sql += '\'' + splited[27] + '\', '
    sql += '\'' + splited[28] + '\', '
    sql += '\'' + splited[29] + '\', '
    sql += '\'' + splited[30] + '\', '
    sql += '\'' + splited[31] + '\', '
    sql += '\'' + splited[32] + '\' '

    sql += ')'

    print sql
    f2.write(sql+'\n')


f2.close()
f1.close()

原文

When I tried to parse a csv which was exported by MS SQL 2005 express edition's query, the string python gives me is totally unexpected. For example if the line in the csv file is :"
aaa,bbb,ccc,dddd", then when python parsed it as string, it becomes :" a a a a , b b b , c c c, d d d d" something like that.....What happens???
I tried to remove the space in the code but don't work.

import os
import random


f1 = open('a.txt', 'r')
f2 = open('dec_sql.txt', 'w')
text = 'abc'

while(text  != ''):

    text = f1.readline()
    if(text==''):
        break

    splited = text.split(',')
    for i in range (0, 32):
        splited[i] = splited[i].replace(' ', '')

    sql = 'insert into dbo.INBOUND_RATED_DEC2010 values ('
    sql += '\'' + splited[0] + '\', '
    sql += '\'' + splited[1] + '\', '
    sql += '\'' + splited[2] + '\', '
    sql += '\'' + splited[3] + '\', '
    sql += '\'' + splited[4] + '\', '
    sql += '\'' + splited[5] + '\', '
    sql += '\'' + splited[6] + '\', '
    sql += '\'' + splited[7] + '\', '
    sql += '\'' + splited[8] + '\', '
    sql += '\'' + splited[9] + '\', '
    sql += '\'' + splited[10] + '\', '
    sql += '\'' + splited[11] + '\', '
    sql += '\'' + splited[12] + '\', '
    sql += '\'' + splited[13] + '\', '
    sql += '\'' + splited[14] + '\', '
    sql += '\'' + splited[15] + '\', '
    sql += '\'' + splited[16] + '\', '
    sql += '\'' + splited[17] + '\', '
    sql += '\'' + splited[18] + '\', '
    sql += '\'' + splited[19] + '\', '
    sql += '\'' + splited[20] + '\', '
    sql += '\'' + splited[21] + '\', '
    sql += '\'' + splited[22] + '\', '
    sql += '\'' + splited[23] + '\', '
    sql += '\'' + splited[24] + '\', '
    sql += '\'' + splited[25] + '\', '
    sql += '\'' + splited[26] + '\', '
    sql += '\'' + splited[27] + '\', '
    sql += '\'' + splited[28] + '\', '
    sql += '\'' + splited[29] + '\', '
    sql += '\'' + splited[30] + '\', '
    sql += '\'' + splited[31] + '\', '
    sql += '\'' + splited[32] + '\' '

    sql += ')'

    print sql
    f2.write(sql+'\n')


f2.close()
f1.close()

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

恰似旧人归 2024-11-01 03:23:34

在我看来，MS SQL 2005 查询的输出是一个 unicode 文件。 python csv 模块无法处理unicode文件，但是有一些csv 模块文档中的示例代码描述了如何解决该问题。

或者，某些文本编辑器允许您使用不同的编码保存文件。例如，我在 Notepad++ 中打开 MS SQL 2005 查询的结果，它告诉我该文件是 UCS-2 编码的，我可以从“编码”菜单将其转换为 UTF-8。

回复收藏 0 原文

无需解释 2024-11-01 03:23:34

尝试用记事本打开该文件，并使用替换全部功能将 ' ' 替换为 ''

回复收藏 0 原文

春夜浅 2024-11-01 03:23:34

您的文件很可能是用 2 字节字符编码进行编码的 - 最有可能是 utf-16 （但它可能是其他编码。

为了让 CSV 正确读取它，您可以使用编解码器打开它，以便在读取时对其进行解码- 这样做你的 python 程序中有 Unicode 对象（不是字符串对象）

所以，不要使用 Use 打开文件

my_file = open ("data.dat", "rt")

：
导入编解码器

my_file = codecs.open("data.dat", "rt", "utf-16")

然后将其提供给 CSV 模块，使用：

import csv
读者 = csv.reader(my_file)
第一行=假
对于读卡器中的行：
if first_line: #跳过标题行
第一行=真
继续
#assemble sql 查询并发出它

另一件事是，在编程时将“查询”构造成 32 行重复代码是一件好事。即使在缺乏丰富字符串处理功能的语言中，也有更好的方法来做到这一点，但在 Python 中，您可以简单地执行以下操作：

sql = 'insert into dbo.INBOUND_RATED_DEC2010 values (%s);' % ", ".join("'%s'" % value for value in splited )

而不是用那 33 行来组装查询。（我告诉它在里面插入一个字符串
第一个字符串上的括号。在 % 运算符之后，字符串“,”与“join”方法一起使用，以便将作为参数传递给 join 的序列上的所有元素粘贴在一起。该序列由一个字符串组成，其中包含一个用单引号括起来的值，用于分割数组中的每个值。

Your file is most likely encoded with a 2byte character encoding - most likely utf-16 (but it culd be some other encoding.

To get the CSV proper reading it, you'd open it with a codec so that it is decoded as its read - doing that you have Unicode objects (not string objects) inside your python program.

So, instead of opening the file with

my_file = open ("data.dat", "rt")

Use:
import codecs

my_file = codecs.open("data.dat", "rt", "utf-16")

And then feed this to the CSV module, with:

import csv
reader = csv.reader(my_file)
first_line = False
for line in reader:
if first_line: #skips header line
first_line = True
continue
#assemble sql query and issue it

Another thing is that your "query" being constructed into 32 lines of repetitive code is a nice thing to do when programing. Even in languages that lack rich string processing facilities, there are better ways to do it, but in Python, you can simply do:

sql = 'insert into dbo.INBOUND_RATED_DEC2010 values (%s);' % ", ".join("'%s'" % value for value in splited )

Instead of those 33 lines assembling your query. (I am telling it to insert a string inside
the parentheses on the first string. After the %operator, the string ", " is used with the "join" method so that it is used to paste together all elements on the sequence passed as a parameter to join. This sequence is made of a string, containing a value enclosed inside single quotes for each value in your splited array.

回复收藏 0 原文

剩一世无双 2024-11-01 03:23:34

使用 Python 内置的 CSV 阅读器可能会有所帮助。看起来像是 unicode 的问题，这个问题让我很沮丧。

import tkFileDialog
import csv

ENCODING_REGEX_REPLACEMENT_LIST = [(re.compile('\xe2\x80\x99'), "'"),
                                   (re.compile('\xe2\x80\x94'), "--"),
                                   (re.compile('\xe2\x80\x9c'), '"'),
                                   (re.compile('\xe2\x80\x9d'), '"'),
                                   (re.compile('\xe2\x80\xa6'), '...')]

def correct_encoding(csv_row):
    for key in csv_row.keys():
        # if there is a value for the current key
        if csv_row[key]:
            try:
                csv_row[key] = unicode(csv_row[key], errors='strict')
            except ValueError:
                # we have a bad encoding, try iterating through all the known
                # bad encodings in the ENCODING_REGEX_REPLACEMENT and replace
                # everything and then try again
                for (regex, replacement) in ENCODING_REGEX_REPLACEMENT_LIST:
                    csv_row[key] = regex.sub(replacement,csv_row[key])
                print(csv_row)
                csv_row[key] = unicode(csv_row[key])
        # if there is NOT a value for the current key
        else:
            csv_row[key] = unicode('')
    return csv_row

filename = tkFileDialog.askopenfilename()
csv_reader = csv.DictReader(open(filename, "rb"), dialect='excel') # assuming similar dialect
for csv_row in csv_reader:
    csv_row = correct_encoding(csv_row)
    # your application logic here

It may help to use Python's built in CSV reader. Looks like an issue with unicode, a problem that frustrated me a lot.

import tkFileDialog
import csv

ENCODING_REGEX_REPLACEMENT_LIST = [(re.compile('\xe2\x80\x99'), "'"),
                                   (re.compile('\xe2\x80\x94'), "--"),
                                   (re.compile('\xe2\x80\x9c'), '"'),
                                   (re.compile('\xe2\x80\x9d'), '"'),
                                   (re.compile('\xe2\x80\xa6'), '...')]

def correct_encoding(csv_row):
    for key in csv_row.keys():
        # if there is a value for the current key
        if csv_row[key]:
            try:
                csv_row[key] = unicode(csv_row[key], errors='strict')
            except ValueError:
                # we have a bad encoding, try iterating through all the known
                # bad encodings in the ENCODING_REGEX_REPLACEMENT and replace
                # everything and then try again
                for (regex, replacement) in ENCODING_REGEX_REPLACEMENT_LIST:
                    csv_row[key] = regex.sub(replacement,csv_row[key])
                print(csv_row)
                csv_row[key] = unicode(csv_row[key])
        # if there is NOT a value for the current key
        else:
            csv_row[key] = unicode('')
    return csv_row

filename = tkFileDialog.askopenfilename()
csv_reader = csv.DictReader(open(filename, "rb"), dialect='excel') # assuming similar dialect
for csv_row in csv_reader:
    csv_row = correct_encoding(csv_row)
    # your application logic here

回复收藏 0 原文

~没有更多了~