您将如何完成这项数据处理任务?

发布于 2024-07-07 08:42:57 字数 1450 浏览 10 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

腻橙味 2024-07-14 08:42:57

步骤1:将数据放入能够处理大量数据的DBMS中。 适当索引。

步骤 2:使用 SQL 查询来确定感兴趣的值。

您仍然需要为每个想要回答的单独问题编写一个新查询。 然而,我认为这是不可避免的。 它应该可以帮助您节省重复其余工作的时间。

编辑:

请注意,虽然您可能可以简单地上传到单个表中,但如果在将数据加载到单个表中后对数据进行规范化,您很可能会获得更好的查询性能。 这并非完全微不足道,但可能会减少数据量。 确保您有一个好的过程(可能不是存储过程)来标准化数据会有所帮助。

Step 1: get the data into a DBMS that can handle the volume of data. Index appropriately.

Step 2: use SQL queries to determine the values of interest.

You'll still need to write a new query for each separate question you want answered. However, I think that is unavoidable. It should save you replicating the rest of the work.

Edited:

Note that although you probably can do a simple upload into a single table, you might well get better performance out of the queries if you normalize the data after loading it into the single table. This isn't completely trivial, but will likely reduce the volume of data. Making sure you have a good procedure (which will probably not be a stored procedure) for normalizing the data will help.

数理化全能战士 2024-07-14 08:42:57

将数据加载到 SQL Server(或任何其他主流数据库)数据库中的表中,然后编写查询来生成您需要的统计信息。 除了数据库本身以及用于与数据交互的任何 UI 之外,您不需要任何工具(例如,用于 SQL Server 的 SQL Server Management Studio、用于 Oracle 的 TOAD 或 SqlDeveloper 等)。

Load the data into a table in a SQL Server (or any other mainstream db) database, and then write queries to generate the statistics you need. You would not need any tools other than the database itself and whatever UI is used to interact with the data (e.g. SQL Server Management Studio for SQL Server, TOAD or SqlDeveloper for Oracle, etc.).

哑剧 2024-07-14 08:42:57

如果您碰巧使用 Windows,请查看 日志解析器。 它可以作为独立下载找到,也可以作为 IIS 资源工具包

日志解析器可以读取您的日志并将其上传到数据库。


数据库注意事项

对于您的数据库服务器,您需要速度快的服务器(Microsoft SQL Server、IBM 的 DB2、PostgreSQL 或 Oracle)。 mySQL 可能也很有用,但我没有使用大型数据库的经验。

您将需要您能负担得起的所有内存。 如果您定期使用数据库,我认为至少 4 GB。 可以用更少的资源来完成,但您会注意到性能上的巨大差异。

另外,如果您负担得起,并且如果您将定期使用此数据库,请选择多核/多 CPU 服务器。

另一个建议是分析您将要执行的查询之王并相应地规划索引。 请记住:您创建的每个索引都需要额外的存储空间。

当然,在进行大规模数据加载操作之前,请关闭索引甚至销毁索引。 这将使加载速度更快。 数据加载操作后重新索引或重新创建索引。

现在,如果该数据库将是一个持续的操作(即不仅仅是调查/分析某些内容然后丢弃它),您可能需要设计一个带有目录和详细表的数据库模式。 这称为数据库规范化,您想要的规范化的确切数量取决于使用模式(数据加载操作与查询操作)。 如果该数据库将持续使用并且有性能要求,则必须有经验丰富的 DBA。


PS

我会冒险在这里包含一些明显的内容,但是...

我认为您可能对 日志分析器。 这些是从 Web 服务器日志文件生成统计信息的计算机程序(有些程序还可以分析 ftp、sftp 和邮件服务器日志文件)。

Web 日志分析器生成带有统计信息的报告。 通常,报告会生成为 HTML 文件并包含图形。 深度分析和选项多种多样。 有些是非常可定制的,有些则不是。 您会发现商业产品和开源产品。

对于您将管理的数据量,请仔细检查每个候选产品,并仔细查看处理速度和能力。

If you happen to use Windows, take a look at Log Parser. It can be found as a standalone download and also is included as part of the IIS Reource Kit.

Log Parser can read your logs and upload them to the Database.


Database Considerations:

For your Database Server you will want something that is fast (Microsoft SQL Server, IBM's DB2, PostgreSQL or Oracle). mySQL might be useful too but I have not experience with large Databases with it.

You will want all the memory you can afford. If you will be using the Database with regularity I'd say 4 GB at least. It can be done with less but you WILL notice big difference in performance.

Also, go for multicore/multi cpu servers if you can afford it and, again, if you will be using this Database with regularity.

Another recommendation is to analyze the king of queries you will be doing and plan the indexes accordingly. Remember: Every index you create will require additional storage space.

Of course, turn off the indexing or even destroy de indexes before masive data load operations. That will make the load lots faster. Re-index or re-create the indexes after the data load operation.

Now, if this Database will be an ongoing operation (i.e. is not just to investigate/analyze something and then discard it) you may want design a Database Schema with catalog and detail tables. This is called Database Normalization and the exact amount of normalization you will want depends on the usage pattern (data load operations versus query operations). An experienced DBA is a must if this Database will be used on an ongoing basis and have performance requirements.


P.S.

I will take the risk to include something obvious here but...

I think you may be interested in a Log Analyzer. These are computer programs that generate statistics from Web Server log files (some can analyze also ftp, sftp and mail server log files).

Web Log Analyzers generate reports with the statistics. Usually the reports are generated as HTML files and include graphics. There is a fair variety on depth analysis and options. Some are very customizable and some are not. You will find both commercial products and Open Source.

For the amount of data you will be managing, double check each candidate product and take a closer look on speed and ability to handle it.

导入数据时要记住的一件事是尝试创建允许您执行所需类型查询的索引。 考虑一下您将查询什么类型的字段以及这些查询可能是什么样子。 这应该可以帮助您决定需要什么索引。

One thing to keep in mind when you're importing the data is to try to create indexes that will allow you to do the kinds of queries you want to do. Think about what sort of fields will you be querying on and what those queries might look like. That should help you decide what indexing you will need.

羞稚 2024-07-14 08:42:57

25GB 平面文件。 我认为自己编写任何组件来读取此文件不是一个好主意。

我建议您进行 SQL 导入并将所有数据导入 SQL Server。 我同意在 SQL Server 中获取这些数据需要很长时间,但是一旦到达那里,您就可以使用这些数据做任何您想做的事情。

我希望一旦您将这些数据放入数据库中,之后您将获得增量信息而不是 25 GB 的平面文件。

25GB of flat file. I don't think writing any component on your own to read this file will be a good idea.

I would suggest that you should go for SQL import and take all the data to SQL Server. I agree that it would take ages to get this data in SQL Server, but once it is there you can do any thing you want with this data.

I hope once you put this data in DB, after that all you will get delta of information not 25 GB of flat file.

长梦不多时 2024-07-14 08:42:57

您还没有说明平面文件中的数据是如何组织的。 RDBMS 建议是明智的,但假设您的平面文件以某种分隔方式进行格式化,并且数据库导入是一项相对简单的任务。 如果情况并非如此,那么您首先需要完成一项艰巨的任务,将数据干净地反编译为一组可以进行分析的字段。

我假设您的数据不是一个好的 CSV 或 TXT 文件,因为您没有说过任何一种方式,而且没有其他人回答过这部分可能的问题。

如果数据具有规则结构,即使没有良好的干净字段分隔符,您也可以使用 ETL 工具来完成任务,例如 Informatica。 由于您是一名技术人员并且这是一项一次性工作,因此您绝对应该考虑编写一些自己的代码,这些代码会进行一些正则表达式比较以提取您想要的部分,并生成一个文件,然后您可以将其加载到数据库中。 无论哪种方式,您都必须投入大量精力来解析和清理数据,因此不要认为这是一项简单的任务。

如果您确实编写自己的代码,那么我建议您选择一种编译语言,并确保一次处理一行数据(或者以将读取缓冲到可管理块的方式)。

无论哪种方式,您都会有一项相当大的工作来确保您应用于数据的任何过程的结果都得到一致执行,您不希望 IP 地址在计算中显示为十进制数字。 对于这种规模的数据,很难检测到这样的故障。

一旦您解析了它,我认为 RDBMS 是存储和分析数据的正确选择。

You haven't said how the data in your flat file is organised. The RDBMS suggestions are sensible, but presume that your flat file is formatted in some delimited way and a db import is a relatively simple task. If that is not the case then you first have the daunting task of decompiling the data cleanly into a set of fields on which you can do your analysis.

I'm going to presume that your data is not a nice CSV or TXT file, since you haven't said either way and nobody else has answered this part of the possible problem.

If the data have a regular structure, even without nice clean field delimiters you may be able to turn an ETL tool onto the job, such as Informatica. Since you are a techy and this is a one-off job, you should definitely consider writing some code of your own which does some regex comparisons for extraction of the parts that you want and spits out a file which you can then load into a database. Either way you are going to have to invest some significant effort in parsing and cleansing your data, so don't think of this as an easy task.

If you do write your own code then I would suggest you choose a compiled language and make sure you process the data a single row at a time (or in a way that buffers the reads into manageable chunks).

Either way you are going to have a pretty big job making sure that the results of any process that you apply to the data have been consistently executed, you don't want IP addresses turing up as decimal numbers in your calculations. On data of that scale it can be hard to detect a fault like that.

Once you have parsed it then I think that an RDBMS is the right choice to store and analyse your data.

っ〆星空下的拥抱 2024-07-14 08:42:57

这是一次性的事情还是您会每天、每周处理事情? 无论哪种方式,请查看 vmarquez 的答案,我听说过有关 logparser 的很棒的事情。 另请查看http://awstats.sourceforge.net/,它是一个成熟的网络统计应用程序。

Is this a one time thing or will you be processing things on a daily, weekly basis? Either way check out vmarquez's answer I've heard great things about logparser. Also check out http://awstats.sourceforge.net/ it's a full fledged web stats application.

屋顶上的小猫咪 2024-07-14 08:42:57

SQL Server Analysis Services 旨在准确执行那种类型的数据分析。 学习曲线有点陡峭,但是一旦设置了模式,您将能够非常快速地执行您想要的任何类型的横切查询。

SQL Server Analysis Services is designed for doing exactly that type of data analysis. The learning curve is a bit steep, but once you set up your schema you will be able to do any kind of cross-cutting queries that you want very quickly.

逆夏时光 2024-07-14 08:42:57

如果您有不止一台计算机可供使用,那么这是 MapReduce 的完美工作。

If you have more than one computer at your disposal, this is a perfect job for MapReduce.

滥情哥ㄟ 2024-07-14 08:42:57

对我来说听起来像是 Perl 的工作。 只需计算您想要的统计数据即可。 使用正则表达式来解析该行。 解析该大小的文件可能需要不到 10 分钟的时间。 我的计算机使用 Perl 在大约 45 秒内读完一个 2 gig 文件(1300 万行)。

Sounds like a job for perl to me. Just keep count of the stats you want. Use regex to parse the line. It would probably take less than 10 minutes to parse that size file. My computer reads through a 2 gig file (13 million lines) in about 45 seconds with perl.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文