使用python解析CSV文件(稍后制作决策树)
首先,全面披露:这是一项大学作业,所以我不想收到代码。 :)。我更多的是寻找方法;我对 Python 非常陌生,读过一本书,但尚未编写任何代码。
整个任务是导入 CSV 文件的内容,根据 CSV 文件的内容创建决策树(使用 ID3 算法),然后解析第二个 CSV 文件以针对树运行。有一个很大的(可以理解的)偏好让它能够处理不同的 CSV 文件(我问是否允许我们对列名称进行硬编码,主要是为了消除它的可能性,答案是否定的)。
CSV 文件采用相当标准的格式;标题行用 # 标记,然后显示列名称,之后的每一行都是一系列简单的值。示例:
# Column1, Column2, Column3, Column4
Value01, Value02, Value03, Value04
Value11, Value12, Value13, Value14
目前,我正在尝试解决第一部分:解析 CSV。为了为决策树做出决策,字典结构似乎是最合乎逻辑的;所以我正在考虑按照这些思路做一些事情:
Read in each line, character by character
If the character is not a comma or a space
Append character to temporary string
If the character is a comma
Append the temporary string to a list
Empty string
Once a line has been read
Create a dictionary using the header row as the key (somehow!)
Append that dictionary to a list
但是,如果我这样做,我不确定如何在键和值之间进行映射。我还想知道是否有某种方法可以对列表中的每个字典执行操作,因为我需要做一些事情来达到“每个人都返回列 Column1 和 Column4 的值,这样我就可以计数”谁有什么!” - 我认为有某种机制,但我认为我不知道该怎么做。
字典是最好的方法吗?使用其他数据结构做事会更好吗?如果是这样,那又怎样?
First off, full disclosure: This is going towards a uni assignment, so I don't want to receive code. :). I'm more looking for approaches; I'm very new to python, having read a book but not yet written any code.
The entire task is to import the contents of a CSV file, create a decision tree from the contents of the CSV file (using the ID3 algorithm), and then parse a second CSV file to run against the tree. There's a big (understandable) preference to have it capable of dealing with different CSV files (I asked if we were allowed to hard code the column names, mostly to eliminate it as a possibility, and the answer was no).
The CSV files are in a fairly standard format; the header row is marked with a # then the column names are displayed, and every row after that is a simple series of values. Example:
# Column1, Column2, Column3, Column4
Value01, Value02, Value03, Value04
Value11, Value12, Value13, Value14
At the moment, I'm trying to work out the first part: parsing the CSV. To make the decisions for the decision tree, a dictionary structure seems like it's going to be the most logical; so I was thinking of doing something along these lines:
Read in each line, character by character
If the character is not a comma or a space
Append character to temporary string
If the character is a comma
Append the temporary string to a list
Empty string
Once a line has been read
Create a dictionary using the header row as the key (somehow!)
Append that dictionary to a list
However, if I do things that way, I'm not sure how to make a mapping between the keys and the values. I'm also wondering whether there is some way to perform an action on every dictionary in a list, since I'll need to be doing things to the effect of "Everyone return their values for columns Column1 and Column4, so I can count up who has what!" - I assume that there is some mechanism, but I don't think I know how to do it.
Is a dictionary the best way to do it? Would I be better off doing things using some other data structure? If so, what?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
Python 内置了一些非常强大的语言结构。您可以从文件中读取行,例如:
您可以使用 string.split 函数以逗号分隔行,您可以使用 string.strip< /a> 消除中间的空格。 Python 有非常强大的 列表 和 字典。
要创建列表,只需使用 [] 等空括号,而创建空字典则使用 {}:
可以使用 .append() 函数插入到列表中,而可以使用索引下标插入到字典中。例如,您可以使用
mylist.append(5)
将 5 添加到列表中,同时可以使用mydict[key]=value
关联键键
与值value
。要测试字典中是否存在某个键,可以使用in
关键字。例如:要迭代列表或字典的内容,您可以简单地使用 for 循环,如下所示:
由于在许多情况下,迭代列表时需要同时具有值和索引,因此还有一个名为 enumerate 的内置函数,可以帮助您省去自己计算索引的麻烦:
上面的代码在功能上与以下内容相同:
如果您选择,您还可以迭代索引:
此外,您还可以获取列表中的元素数量或使用 len 的字典中的键数。
可以通过使用列表理解对字典或列表的每个元素执行操作;但是,我建议您只需使用 for 循环来完成该任务。但是,举个例子:
当你有时间时,我建议你阅读 Python 教程 来获取一些更深入的知识。
Python has some pretty powerful language constructs builtin. You can read lines from a file like:
You can use the string.split function to separate the line along commas, and you can use string.strip to eliminate intervening whitespace. Python has very powerful lists and dictionaries.
To create a list, you simply use empty brackets like [], while to create an empty dictionary you use {}:
You can insert into the list using the .append() function, while you can use indexing subscripts to insert into the dictionary. For example, you can use
mylist.append(5)
to add 5 to the list, while you can usemydict[key]=value
to associate the keykey
with the valuevalue
. To test whether a key is present in the dictionary, you can use thein
keyword. For example:To iterate over the contents of a list or dictionary, you can simply use a for-loop as in:
Since, in many cases, it is necessary to have both the value and index when iterating over a list, there is also a builtin function called enumerate that saves you the trouble of counting indices yourself:
The code above is identical in function to:
You could also iterate over the indices if you so chose:
Also, you can get the number of elements in a list or the number of keys in a dictionary using len.
It is possible to perform an operation on each element of a dictionary or list via the use of list comprehension; however, I would recommend that you simply use for-loops to accomplish that task. But, as an example:
When you have the time, I suggest you read the Python tutorial to get some more in-depth knowledge.
使用 docs.python.orgcsv 模块的示例a>:
您可以将每一行保存到一个列表中,然后在 ID3 中对其进行处理,而不是
打印
行。Example using the
csv
module from docs.python.org:Instead of
print
ing the rows, you could just save each row into a list, and then process it in the ID3 later.简短回答:不要浪费时间和精力 (1) 重新实现内置 csv 模块 (2) 读取 csv 模块的源代码(它是用 C 编写的)——只需使用它即可!
Short answer: don't waste time and mental energy (1) reimplementing the built-in csv module (2) reading the csv module's source (it's written in C) -- just USE it!
查看 csv.DictReader。
例子:
Look at csv.DictReader.
Example:
查看内置的 CSV 模块。虽然您可能不能只使用它,但您可以先看一下代码...
如果这是不行的,那么您的(伪)代码看起来非常好,尽管您应该使用
str. split()
函数并使用它,逐行读取文件。Take a look at the built-in CSV module. Though you probably can't just use it, you can take a sneak peek at the code...
If that's a no-no, your (pseudo)code looks perfectly fine, though you should make use of the
str.split()
function and use that, reading the file line-by-line.正确解析 CSV
我会避免使用 str.split() 来解析字段,因为 str.split() 不会识别带引号的值。许多现实世界的 CSV 文件都使用引号。
http://en.wikipedia.org/wiki/Comma-separated_values
使用的示例记录带引号的值:
如果使用 str.split(),您将得到这样的记录,其中包含 5 个字段:
但您真正想要的是这样的记录,包含 4 个字段:
另外,除了数据中存在逗号之外,您可能还必须处理数据中的换行符“\r\n”或仅“\n”。例如:
所以要小心使用:
另外,就像 John 提到的,CSV 标准是,在引号中,如果你得到一个双引号,那么它就会变成一个引号。
所以我建议像这样修改你的有限状态机:
顺便说一句,有趣的是,我从未见过在 CSV 中使用 # 注释掉标题。所以对我来说,这意味着您可能还必须在数据中查找注释行。使用 # 注释掉 CSV 文件中的一行并不标准。
使用标头键将找到的字段添加到记录字典中
根据内存要求,如果 CSV 足够小(可能有 10k 到 100k 条记录),则使用字典就可以了。只需存储所有列名称的
列表
,以便您可以通过索引(或数字)访问列名称。然后在有限状态机中,当找到逗号时增加列索引,并在找到换行符时重置为 0。因此,如果您的标头是
header = ['Column1', 'Column2']
那么当您找到数据字符时,请像这样添加它:Parse the CSV correctly
I'd avoid using str.split() to parse the fields because str.split() will not recognize quoted values. And many real-world CSV files use quotes.
http://en.wikipedia.org/wiki/Comma-separated_values
Example record using quoted values:
If you use str.split(), you'll get a record like this with 5 fields:
But what you really want are records like this with 4 fields:
Also, besides commas being in the data, you may have to deal with newlines "\r\n" or just "\n" in the data. For example:
So be careful using:
Also, like John mentioned, the CSV standard is, that in quotes, if you get a double-double quote, then it turns into one quote.
So I'd suggest to modify your finite state machine like this:
On a side note, interestingly, I've never seen a header commented out using # in a CSV. So to me, that would imply that you may have to look for commented lines in the data too. Using # to comment out a line in a CSV file is not standard.
Adding found fields into a record dictionary using header keys
Depending on memory requirements, if the CSV is small enough (maybe 10k to 100k records), using a dictionary is fine. Just store a
list
of all the column names so you can access the column name by index (or number). Then in the finite state machine, increment the column index when you find a comma, and reset to 0 when you find a newline.So if your header is
header = ['Column1', 'Column2']
Then when you find a data character, add it like this:我对 @Kaloyan Todorov 谈论的内置 csv 模块不太了解,但是,如果您正在阅读逗号分隔的行,那么您可以轻松执行此操作:
这将打印每行的所有条目,而无需前导 a尾随空格。
I don't know too much about the builtin csv module that @Kaloyan Todorov talks about, but, if you're reading comma separated lines, then you can easily do this:
This will print all the entries of each line without the leading a tailing whitespaces.