什么数据验证方法最适合大型数据集
我有一个大型数据库,想要实现一个允许用户批量更新信息的功能。 用户下载 Excel 文件,进行更改,系统接受该 Excel 文件。
- 用户使用 Web 界面 (ASP.NET) 将数据从数据库下载到 Excel。
- 用户修改 Excel 文件。 只允许修改某些数据作为其他映射到数据库中的数据。
- 一旦用户对更改感到满意,他们就会通过 ASP.NET 界面上传更改后的 Excel 文件。
- 现在,服务器的工作是从 Excel 文件中提取数据(使用 Gembox)并根据数据库验证数据(这是我遇到麻烦的地方)。
- 验证完成后,验证结果将显示在另一个 ASP.NET 页面上。 验证是软验证,因此硬失败仅发生在缺少映射到数据库的索引时。 (丢失数据导致忽略等)
- 用户可以决定将采取的操作是否适当,在接受这些操作时系统将应用更改。 (添加、修改或忽略)
在应用用户所做的更改和/或添加之前,必须验证数据以避免用户错误。 (他们无意中意外删除了日期)
对于需要更新到超过 65k 的行来说,这并不牵强。
问题是:解析数据以进行验证并构建更改和添加集的最佳方法是什么?
如果我将 Excel 数据必须验证的所有数据加载到内存中,我可能会不必要地影响已经占用内存的应用程序。 如果我对 Excel 文件中的每个元组进行数据库命中,我会看到超过 65k 的数据库命中。
帮助?
I have a large database and want to implement a feature which would allow a user to do a bulk update of information. The user downloads an excel file, makes the changes and the system accepts the excel file.
- The user uses a web interface (ASP.NET) to download the data from database to Excel.
- User modifies the Excel file. Only certain data is allowed to be modified as other map into the DB.
- Once the user is happy with their changes they upload the changed Excel file through the ASP.NET interface.
- Now it's the server's job to suck data from the Excel file (using Gembox) and validate the data against the database (this is where I'm having the trouble)
- Validation results are shown on another ASP.NET page after validation is complete. Validation is soft and so hard fails only occur when say an index mapping into DB is missing. (Missing data causes ignore, etc)
- User can decide whether the actions that will be taken are appropriate, in accepting these the system will apply the changes. (Add, Modify, or Ignore)
Before applying the changes and/or additions the user has made, the data must be validated to avoid mistakes by the user. (The accidentally deleted dates which they didn't mean to)
It's not far fetched for the rows that need updating to reach over 65k.
The question is: What is the best way to parse the data to do validation and to build up the change and addition sets?
If I load all data that the excel data must be validated against into memory I might unnecessarily be affecting the already memory hungry application. If I do a database hit for every tuple in the excel file I am looking at over 65k database hits.
Help?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我过去见过的方法是:
这对于验证缺失的列、有效的键值等非常有效。对于检查单个字段的格式(不要让 SQL 将字符串分开)来说不太好。
众所周知,有些人对将业务逻辑放入数据库感到不舒服,但这种方法确实限制了应用程序的数据库命中次数,并且避免了将所有数据一次性保存在内存中。
The approach I've seen used in the past is:
This works well for validating missing columns, valid key values etc. It's not so good for checking the format of individual fields (don't make SQL pull strings apart).
As we know, some folk feel uncomfortable putting business logic in the database, but this approach does limit the number of database hits your application makes, and avoids holding all the data in memory at once.
您的问题在数据仓库系统中很常见,其中批量上传和数据清理是(常规)工作的核心部分。 我建议你谷歌一下 ETL(提取转换加载)、暂存表,你会发现很多好东西。
概括地回答您的问题,如果您确实“将数据加载到内存中”进行检查,那么您实际上是在自己的代码中重新实现了数据库引擎的一部分。 现在,如果这样做更快、更聪明,那可能是一件好事。 例如,您的 Excel 数据提取可能只有一小部分有效日期,因此您无需连接到表来检查日期是否在范围内。 但是,对于其他数据(例如外键等),让数据库做它擅长的事情。
当数据量变大时,使用临时表/数据库/服务器是一种常见的解决方案。 顺便说一句,允许用户清理 Excel 中的数据确实是个好主意,但允许他们“意外”删除关键数据却是一个非常糟糕的主意。 您可以锁定单元格/列来防止这种情况,和/或在 Excel 中进行一些基本验证吗? 如果某个字段应该填写并且应该是日期,您可以在几行 Excel 中进行检查。 您的用户会很高兴,因为他们在发现问题之前无需上传。
Your problem is very common in Data Warehouse systems, where bulk uploads and data cleansing are a core part of the (regular) work to be done. I suggest you google around ETL (Extract Transform Load), Staging tables and you'll find a wealth of good stuff.
In broad answer to your problem, if you do 'load the data into memory' for checking, you're effectively re-implementing a part of the DB engine in your own code. Now that could be a good thing if it's faster and clever to do so. For instance you may only have a small range of valid dates for your Excel extract, so you don't need to join to a table to check that dates are in range. However, for other data like foreign keys etc, let the DB do what it's good at.
Using a staging table/database/server is a common solution as the data volumes get large. BTW allowing users to clean data in Excel is a really good idea, allowing them to 'accidentally' remove crucial data is a really bad idea. Can you lock cells/columns to prevent this, and/or put in some basic validation into Excel. If a field should be filled and should be a date, you can check that in a few lines of excel. Your users will be happy as they don't have to upload before finding problems.
为了正确回答这个问题,以下信息将很有用。
To answer this properly the following information would be useful
首先使用批量上传将文本文件数据存储在临时表中。 然后检索它,并使用您制作的界面进行验证。 验证后将其存储在主表或数据库中
first store in a temp table from text file data using bulk uploading. then retrives this, and validate using your made interface. and after validation store it in the main table or DB