使用python解析CSV文件(稍后制作决策树)

发布于 2024-08-30 12:05:47 字数 1142 浏览 9 评论 0原文

首先,全面披露:这是一项大学作业,所以我不想收到代码。 :)。我更多的是寻找方法;我对 Python 非常陌生,读过一本书,但尚未编写任何代码。

整个任务是导入 CSV 文件的内容,根据 CSV 文件的内容创建决策树(使用 ID3 算法),然后解析第二个 CSV 文件以针对树运行。有一个很大的(可以理解的)偏好让它能够处理不同的 CSV 文件(我问是否允许我们对列名称进行硬编码,主要是为了消除它的可能性,答案是否定的)。

CSV 文件采用相当标准的格式;标题行用 # 标记,然后显示列名称,之后的每一行都是一系列简单的值。示例:

# Column1, Column2, Column3, Column4
Value01, Value02, Value03, Value04
Value11, Value12, Value13, Value14

目前,我正在尝试解决第一部分:解析 CSV。为了为决策树做出决策,字典结构似乎是最合乎逻辑的;所以我正在考虑按照这些思路做一些事情:

Read in each line, character by character
If the character is not a comma or a space
    Append character to temporary string
If the character is a comma
    Append the temporary string to a list
    Empty string
Once a line has been read
    Create a dictionary using the header row as the key (somehow!)
    Append that dictionary to a list

但是,如果我这样做,我不确定如何在键和值之间进行映射。我还想知道是否有某种方法可以对列表中的每个字典执行操作,因为我需要做一些事情来达到“每个人都返回列 Column1 和 Column4 的值,这样我就可以计数”谁有什么!” - 我认为有某种机制,但我认为我不知道该怎么做。

字典是最好的方法吗?使用其他数据结构做事会更好吗?如果是这样,那又怎样?

First off, full disclosure: This is going towards a uni assignment, so I don't want to receive code. :). I'm more looking for approaches; I'm very new to python, having read a book but not yet written any code.

The entire task is to import the contents of a CSV file, create a decision tree from the contents of the CSV file (using the ID3 algorithm), and then parse a second CSV file to run against the tree. There's a big (understandable) preference to have it capable of dealing with different CSV files (I asked if we were allowed to hard code the column names, mostly to eliminate it as a possibility, and the answer was no).

The CSV files are in a fairly standard format; the header row is marked with a # then the column names are displayed, and every row after that is a simple series of values. Example:

# Column1, Column2, Column3, Column4
Value01, Value02, Value03, Value04
Value11, Value12, Value13, Value14

At the moment, I'm trying to work out the first part: parsing the CSV. To make the decisions for the decision tree, a dictionary structure seems like it's going to be the most logical; so I was thinking of doing something along these lines:

Read in each line, character by character
If the character is not a comma or a space
    Append character to temporary string
If the character is a comma
    Append the temporary string to a list
    Empty string
Once a line has been read
    Create a dictionary using the header row as the key (somehow!)
    Append that dictionary to a list

However, if I do things that way, I'm not sure how to make a mapping between the keys and the values. I'm also wondering whether there is some way to perform an action on every dictionary in a list, since I'll need to be doing things to the effect of "Everyone return their values for columns Column1 and Column4, so I can count up who has what!" - I assume that there is some mechanism, but I don't think I know how to do it.

Is a dictionary the best way to do it? Would I be better off doing things using some other data structure? If so, what?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

七度光 2024-09-06 12:05:47

Python 内置了一些非常强大的语言结构。您可以从文件中读取行,例如:

with open(name_of_file,"r") as file:
    for line in file:
         # process the line

您可以使用 string.split 函数以逗号分隔行,您可以使用 string.strip< /a> 消除中间的空格。 Python 有非常强大的 列表字典

要创建列表,只需使用 [] 等空括号,而创建空字典则使用 {}:

mylist = []; # Creates an empty list
mydict = {}; # Creates an empty dictionary

可以使用 .append() 函数插入到列表中,而可以使用索引下标插入到字典中。例如,您可以使用 mylist.append(5) 将 5 添加到列表中,同时可以使用 mydict[key]=value 关联键 与值value。要测试字典中是否存在某个键,可以使用 in 关键字。例如:

if key in mydict:
   print "Present"
else:
   print "Absent"

要迭代列表或字典的内容,您可以简单地使用 for 循环,如下所示:

for val in mylist:
    # do something with val

for key in mydict:
    # do something with key or with mydict[key]

由于在许多情况下,迭代列表时需要同时具有值​​和索引,因此还有一个名为 enumerate 的内置函数,可以帮助您省去自己计算索引的麻烦:

for idx, val in enumerate(mylist):
    # do something with val or with idx. Note that val=mylist[idx]

上面的代码在功能上与以下内容相同:

idx=0
for val in mylist:
   # process val, idx
   idx += 1

如果您选择,您还可以迭代索引:

for idx in xrange(len(mylist)):
    # Do something with idx and possibly mylist[idx]

此外,您还可以获取列表中的元素数量或使用 len 的字典中的键数。

可以通过使用列表理解对字典或列表的每个元素执行操作;但是,我建议您只需使用 for 循环来完成该任务。但是,举个例子:

>>> list1 = range(10)
>>> list1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list2 = [2*x for x in list1]
>>> list2
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

当你有时间时,我建议你阅读 Python 教程 来获取一些更深入的知识。

Python has some pretty powerful language constructs builtin. You can read lines from a file like:

with open(name_of_file,"r") as file:
    for line in file:
         # process the line

You can use the string.split function to separate the line along commas, and you can use string.strip to eliminate intervening whitespace. Python has very powerful lists and dictionaries.

To create a list, you simply use empty brackets like [], while to create an empty dictionary you use {}:

mylist = []; # Creates an empty list
mydict = {}; # Creates an empty dictionary

You can insert into the list using the .append() function, while you can use indexing subscripts to insert into the dictionary. For example, you can use mylist.append(5) to add 5 to the list, while you can use mydict[key]=value to associate the key key with the value value. To test whether a key is present in the dictionary, you can use the in keyword. For example:

if key in mydict:
   print "Present"
else:
   print "Absent"

To iterate over the contents of a list or dictionary, you can simply use a for-loop as in:

for val in mylist:
    # do something with val

for key in mydict:
    # do something with key or with mydict[key]

Since, in many cases, it is necessary to have both the value and index when iterating over a list, there is also a builtin function called enumerate that saves you the trouble of counting indices yourself:

for idx, val in enumerate(mylist):
    # do something with val or with idx. Note that val=mylist[idx]

The code above is identical in function to:

idx=0
for val in mylist:
   # process val, idx
   idx += 1

You could also iterate over the indices if you so chose:

for idx in xrange(len(mylist)):
    # Do something with idx and possibly mylist[idx]

Also, you can get the number of elements in a list or the number of keys in a dictionary using len.

It is possible to perform an operation on each element of a dictionary or list via the use of list comprehension; however, I would recommend that you simply use for-loops to accomplish that task. But, as an example:

>>> list1 = range(10)
>>> list1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list2 = [2*x for x in list1]
>>> list2
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

When you have the time, I suggest you read the Python tutorial to get some more in-depth knowledge.

定格我的天空 2024-09-06 12:05:47

使用 docs.python.orgcsv 模块的示例a>:

import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
    print row

您可以将每一行保存到一个列表中,然后在 ID3 中对其进行处理,而不是打印行。

database.append(row)

Example using the csv module from docs.python.org:

import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
    print row

Instead of printing the rows, you could just save each row into a list, and then process it in the ID3 later.

database.append(row)
感受沵的脚步 2024-09-06 12:05:47

简短回答:不要浪费时间和精力 (1) 重新实现内置 csv 模块 (2) 读取 csv 模块的源代码(它是用 C 编写的)——只需使用它即可!

Short answer: don't waste time and mental energy (1) reimplementing the built-in csv module (2) reading the csv module's source (it's written in C) -- just USE it!

送君千里 2024-09-06 12:05:47

查看 csv.DictReader。

例子:

import csv
reader = csvDictReader(open('my_file.csv','rb') # 'rb' = read binary
for d in reader:
    print d # this will print out a dictionary with keys equal to the first row of the file.

Look at csv.DictReader.

Example:

import csv
reader = csvDictReader(open('my_file.csv','rb') # 'rb' = read binary
for d in reader:
    print d # this will print out a dictionary with keys equal to the first row of the file.
五里雾 2024-09-06 12:05:47

查看内置的 CSV 模块。虽然您可能不能只使用它,但您可以先看一下代码...

如果这是不行的,那么您的(伪)代码看起来非常好,尽管您应该使用 str. split() 函数并使用它,逐行读取文件。

Take a look at the built-in CSV module. Though you probably can't just use it, you can take a sneak peek at the code...

If that's a no-no, your (pseudo)code looks perfectly fine, though you should make use of the str.split() function and use that, reading the file line-by-line.

醉生梦死 2024-09-06 12:05:47

正确解析 CSV

我会避免使用 str.split() 来解析字段,因为 str.split() 不会识别带引号的值。许多现实世界的 CSV 文件都使用引号。
http://en.wikipedia.org/wiki/Comma-separated_values

使用的示例记录带引号的值:

1997,Ford,E350,"Super, luxurious truck"

如果使用 str.split(),您将得到这样的记录,其中包含 5 个字段:

('1997', 'Ford', 'E350', '"Super', ' luxurious truck"')

但您真正想要的是这样的记录,包含 4 个字段:

('1997', 'Ford', 'E350', 'Super, luxurious truck')

另外,除了数据中存在逗号之外,您可能还必须处理数据中的换行符“\r\n”或仅“\n”。例如:

1997,Ford,E350,"Super
luxurious truck"
1997,Ford,E250,"Ok? Truck"

所以要小心使用:

file = open('filename.csv', 'r')
for line in file:
    # problem here, "line" may contain partial data

另外,就像 John 提到的,CSV 标准是,在引号中,如果你得到一个双引号,那么它就会变成一个引号。

1997,Ford,E350,"Super ""luxurious"" truck"

('1997', 'Ford', 'E350', 'Super "luxurious" truck')

所以我建议像这样修改你的有限状态机:

  • 一次解析每个字符。
  • 检查是否是引用,然后将状态设置为“引用中”
  • 如果“在引号中”,则存储当前字段中的所有字符,直到出现另一个引号。
  • 如果“在引号中”,并且还有另一个引号,则将引号字符存储在字段数据中。 (不是结束,因为空白字段不应该是`data,"",data`,而是`data,,data`)
  • 如果不是“在引号中”,则存储字符,直到找到逗号或换行符。
  • 如果是逗号,则保存字段并开始一个新字段。
  • 如果换行,保存字段,保存记录,开始一个新记录和一个新字段。

顺便说一句,有趣的是,我从未见过在 CSV 中使用 # 注释掉标题。所以对我来说,这意味着您可能还必须在数据中查找注释行。使用 # 注释掉 CSV 文件中的一行并不标准。

使用标头键将找到的字段添加到记录字典中

根据内存要求,如果 CSV 足够小(可能有 10k 到 100k 条记录),则使用字典就可以了。只需存储所有列名称的列表,以便您可以通过索引(或数字)访问列名称。然后在有限状态机中,当找到逗号时增加列索引,并在找到换行符时重置为 0。

因此,如果您的标头是 header = ['Column1', 'Column2'] 那么当您找到数据字符时,请像这样添加它:

record[header[column_index]] += character

Parse the CSV correctly

I'd avoid using str.split() to parse the fields because str.split() will not recognize quoted values. And many real-world CSV files use quotes.
http://en.wikipedia.org/wiki/Comma-separated_values

Example record using quoted values:

1997,Ford,E350,"Super, luxurious truck"

If you use str.split(), you'll get a record like this with 5 fields:

('1997', 'Ford', 'E350', '"Super', ' luxurious truck"')

But what you really want are records like this with 4 fields:

('1997', 'Ford', 'E350', 'Super, luxurious truck')

Also, besides commas being in the data, you may have to deal with newlines "\r\n" or just "\n" in the data. For example:

1997,Ford,E350,"Super
luxurious truck"
1997,Ford,E250,"Ok? Truck"

So be careful using:

file = open('filename.csv', 'r')
for line in file:
    # problem here, "line" may contain partial data

Also, like John mentioned, the CSV standard is, that in quotes, if you get a double-double quote, then it turns into one quote.

1997,Ford,E350,"Super ""luxurious"" truck"

('1997', 'Ford', 'E350', 'Super "luxurious" truck')

So I'd suggest to modify your finite state machine like this:

  • Parse each character at a time.
  • Check to see if it's a quote, then set the state to "in quote"
  • If "in quote", store all the characters in the current field until there's another quote.
  • If "in quote", and there's another quote, store the quote character in the field data. (not the end, because a blank field shouldn't be `data,"",data` but instead `data,,data`)
  • If not "in quote", store the characters until you find a comma or newline.
  • If comma, save field and start a new field.
  • If newline, save field, save record, start a new record and a new field.

On a side note, interestingly, I've never seen a header commented out using # in a CSV. So to me, that would imply that you may have to look for commented lines in the data too. Using # to comment out a line in a CSV file is not standard.

Adding found fields into a record dictionary using header keys

Depending on memory requirements, if the CSV is small enough (maybe 10k to 100k records), using a dictionary is fine. Just store a list of all the column names so you can access the column name by index (or number). Then in the finite state machine, increment the column index when you find a comma, and reset to 0 when you find a newline.

So if your header is header = ['Column1', 'Column2'] Then when you find a data character, add it like this:

record[header[column_index]] += character
浮生未歇 2024-09-06 12:05:47

我对 @Kaloyan Todorov 谈论的内置 csv 模块不太了解,但是,如果您正在阅读逗号分隔的行,那么您可以轻松执行此操作:

for line in file:
    columns = line.split(',')
    for column in columns:
        print column.strip()

这将打印每行的所有条目,而无需前导 a尾随空格。

I don't know too much about the builtin csv module that @Kaloyan Todorov talks about, but, if you're reading comma separated lines, then you can easily do this:

for line in file:
    columns = line.split(',')
    for column in columns:
        print column.strip()

This will print all the entries of each line without the leading a tailing whitespaces.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文