分析大量数据的有效方法?
我需要分析数万行数据。数据是从文本文件导入的。每行数据有八个变量。目前,我使用类来定义数据结构。当我阅读文本文件时,我将每个行对象存储在通用列表 List 中。
我想知道是否应该改用关系数据库(SQL),因为我需要分析每行文本中的数据,尝试将其与我目前也存储在通用列表(列表)中的定义术语相关联。
目标是使用定义翻译大量数据。我希望定义的数据是可过滤的、可搜索的等。我越想越觉得使用数据库更有意义,但我想在做出更改之前与更有经验的开发人员确认(我正在使用结构体和首先是数组列表)。
我能想到的唯一缺点是数据在用户翻译和查看后不需要保留。不需要永久存储数据,因此使用数据库可能有点矫枉过正。
I need to analyze tens of thousands of lines of data. The data is imported from a text file. Each line of data has eight variables. Currently, I use a class to define the data structure. As I read through the text file, I store each line object in a generic list, List.
I am wondering if I should switch to using a relational database (SQL) as I will need to analyze the data in each line of text, trying to relate it to definition terms which I also currently store in generic lists (List).
The goal is to translate a large amount of data using definitions. I want the defined data to be filterable, searchable, etc. Using a database makes more sense the more I think about it, but I would like to confirm with more experienced developers before I make the changes, yet again (I was using structs and arraylists at first).
The only drawback I can think of, is that the data does not need to be retained after it has been translated and viewed by the user. There is no need for permanent storage of data, therefore using a database might be a little overkill.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
并不是绝对有必要去数据库。这取决于数据的实际大小和您需要执行的过程。如果您使用自定义类将数据加载到列表中,为什么不使用 Linq 来执行查询和过滤呢?类似这样:
真正的问题是数据是否太大以至于无法轻松加载到内存中。如果是这样的话,那么是的,数据库会简单得多。
It is not absolutely necessary to go a database. It depends on the actual size of the data and the process you need to do. If you are loading the data into a List with a custom class, why not use Linq to do your querying and filtering? Something like:
The real question is whether the data is so large that it cannot be loaded up into memory confortably. If that is the case, then yes, a database would be much simpler.
这并不是一个很大的数据量。我认为没有任何理由在您的分析中涉及数据库。
C# 中内置了一种查询语言——LINQ。原始海报当前使用对象列表,因此实际上没有什么可做的。在我看来,在这种情况下数据库所增加的热量远多于光。
This is not a large amount of data. I don't see any reason to involve a database in your analysis.
There IS a query language built into C# -- LINQ. The original poster currently uses a list of objects, so there is really nothing left to do. It seems to me that a database in this situation would add far more heat than light.
听起来你想要的是一个数据库。 Sqlite 支持内存数据库(使用“:memory:”作为文件名)。我怀疑其他人也可能有内存模式。
It sounds like what you want is a database. Sqlite supports in-memory databases (use ":memory:" as the filename). I suspect others may have an in-memory mode as well.
我在以前的公司工作时遇到了与您现在遇到的同样的问题。问题是我正在为大量条形码生成的文件寻找具体且良好的解决方案。条形码生成一个包含数千条记录的文本文件在一个文件中。一开始,操作和呈现数据对我来说非常困难。根据我编程的记录,我创建一个类来读取文件并将数据加载到数据表中,并能够将其保存在数据库。我使用的数据库是SQL server 2005。然后我可以轻松管理保存的数据并以我喜欢的方式呈现它。要点是从文件中读取数据并将其保存到数据库中。如果你这样做因此,您将有很多选择来按照您喜欢的方式进行操作和呈现。
I was facing the same problem that you faced now while I was working on my previous company.The thing is I was looking a concrete and good solution for a lot of bar code generated files.The bar code generates a text file with thousands of records with in a single file.Manipulating and presenting the data was so difficult for me at first.Based on the records what I programmed was, I create a class that read the file and loads the data to the data table and able to save it in database. The database what I used was SQL server 2005.Then I able to manage the saved data easily and present it which way I like it.The main point is read the data from the file and save to it to the data base.If you do so you will have a lot of options to manipulate and present as the way you like it.
如果您不介意使用 access,您可以执行以下操作:
附加一个空白 Access 数据库作为资源
需要时,将数据库写入文件。
运行处理数据列的 CREATE TABLE 语句
将数据导入到新表中
使用 sql 运行您的计算
OnClose 时,删除该访问数据库。
您可以使用Resourcer之类的程序将数据库加载到resx文件中,
然后使用以下代码将资源从项目中提取出来。获取字节数组并将其保存到临时位置,临时文件名
“MyProject.blank_db”是资源文件的位置和名称
“access.blank”是为要保存的资源提供的选项卡
If you do not mind using access, here is what you can do
Attach a blank Access db as a resource
When needed, write the db out to file.
Run a CREATE TABLE statement that handles the columns of your data
Import the data into the new table
Use sql to run your calculations
OnClose, delete that access db.
You can use a program like Resourcer to load the db into a resx file
Then use the following code to pull the resource out of the project. Take the byte array and save it to the temp location with the temp filename
"MyProject.blank_db" is the location and name of the resource file
"access.blank" is the tab given to the resource to save
如果您唯一需要做的就是搜索和替换,您可以考虑使用 sed 和 awk,并且可以使用 grep 进行搜索。当然是在Unix平台上。
If the only thing you need to do is search and replace, you may consider using sed and awk and you can do searches using grep. Of course on a Unix platform.
从你的描述来看,我认为linux命令行工具可以很好地处理你的数据。使用数据库可能会使您的工作不必要地复杂化。如果您使用的是 Windows,也可以通过不同的方式使用这些工具。我会推荐 cygwin。以下工具可以满足您的任务:sort、grep、cut、awk、sed、join、paste。
这些 unix/linux 命令行工具对于 Windows 用户来说可能看起来很可怕,但人们却有理由喜欢它们。以下是我喜欢它们的原因:
From your description, I think linux command line tools can handle your data very well. Using a database may unnecessarily complicate your work. If you are using windows, these tools are also available by different ways. I would recommend cygwin. The following tools may cover your task: sort, grep, cut, awk, sed, join, paste.
These unix/linux command line tools may look scary to a windows person but there are reasons for people who love them. The following are my reasons for loving them: