SQL 中日志记录表的适当查询和索引

发布于 2024-09-14 09:09:30 字数 265 浏览 3 评论 0原文

假设有一个名为“log”的表,其中有大量记录。

应用程序通常通过简单的 SQL 检索数据:

SELECT * 
FROM log 
WHERE logLevel=2 AND (creationData BETWEEN ? AND ?)

logLevelcreationData 有索引,但记录数量较多,导致检索数据需要更长的时间。

我们该如何解决这个问题?

Assume a table named 'log', there are huge records in it.

The application usually retrieves data by simple SQL:

SELECT * 
FROM log 
WHERE logLevel=2 AND (creationData BETWEEN ? AND ?)

logLevel and creationData have indexes, but the number of records makes it take longer to retrieve data.

How do we fix this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

贪恋 2024-09-21 09:09:30

查看您的执行计划/“EXPLAIN PLAN”结果 - 如果您正在检索大量数据,那么您几乎无法提高性能 - 您可以尝试更改您的 SELECT 语句以仅包含您感兴趣的列,但是它不会改变您正在执行的逻辑读取的数量,因此我怀疑它只会对性能产生可忽略不计的影响。

如果您只检索少量记录,那么 LogLevel 索引和 CreationDate 索引应该可以解决问题。

更新: SQL 服务器主要用于查询大型数据库的小子集(例如,从数百万个数据库中返回单个客户记录)。它并没有真正准备好返回真正的大型数据集。如果您返回的数据量真正很大,那么您只能做一定数量的事情,所以我不得不问:

您实际上正在尝试什么实现

  • 如果您向用户显示日志消息,那么他们一次只会对一小部分感兴趣,因此您可能还想研究分页 SQL 数据的有效方法 - 如果您只返回即使一次显示 500 条左右的记录,它仍然应该非常快。

  • 如果您尝试进行某种统计分析,那么您可能希望将数据复制到更适合统计分析的数据存储中。 (但不确定是什么,这不是我的专业领域)

Look at your execution plan / "EXPLAIN PLAN" result - if you are retrieving large amounts of data then there is very little that you can do to improve performance - you could try changing your SELECT statement to only include columns you are interested in, however it won't change the number of logical reads that you are doing and so I suspect it will only have a neglible effect on performance.

If you are only retrieving small numbers of records then an index of LogLevel and an index on CreationDate should do the trick.

UPDATE: SQL server is mostly geared around querying small subsets of massive databases (e.g. returning a single customer record out of a database of millions). Its not really geared up for returning truly large data sets. If the amount of data that you are returning is genuinely large then there is only a certain amount that you will be able to do and so I'd have to ask:

What is it that you are actually trying to achieve?

  • If you are displaying log messages to a user, then they are only going to be interested in a small subset at a time, and so you might also want to look into efficient methods of paging SQL data - if you are only returning even say 500 or so records at a time it should still be very fast.

  • If you are trying to do some sort of statistical analysis then you might want to replicate your data into a data store more suited to statistical analysis. (Not sure what however, that isn't my area of expertise)

稚然 2024-09-21 09:09:30

1:切勿使用Select *
2:确保您的索引正确,并且您的统计信息是最新的
3:(可选)如果您发现您没有查看特定时间之后的日志数据(根据我的经验,如果它发生在一周多前,我可能不需要它的日志)设置一个作业将其存档到某个备份,然后删除未使用的记录。这将缩小表的大小,减少搜索表所需的时间。

1: Never use Select *
2: make sure your indexes are correct, and your statistics are up-to-date
3: (Optional) If you find you're not looking at log data past a certain time (in my experience, if it happened more than a week ago, I'm probably not going to need the log for it) set up a job to archive that to some back-up, and then remove unused records. That will keep the table size down reducing the amount of time it takes search the table.

长伴 2024-09-21 09:09:30

根据您使用的 SQL 数据库类型,您可以查看水平分区。通常,这可以完全在数据库方面完成,因此您无需更改代码。

Depending on what kinda of SQL database you're using, you might look into Horizaontal Partitioning. Oftentimes, this can be done entirely on the database side of things so you won't need to change your code.

花想c 2024-09-21 09:09:30

您需要所有列吗?第一步应该是仅选择那些您实际需要检索的内容。

另一方面是数据到达应用程序后如何处理数据(填充数据集/按顺序读取数据/?)。

处理应用程序方面可能还有一些改进的潜力。

您应该回答自己以下问题:

您需要将所有返回的数据一次性保存在内存中吗?在检索端每行分配多少内存?您一次需要多少内存?你能重用一些内存吗?

Do you need all columns? First step should be to select only those you actually need to retrieve.

Another aspect is what you do with the data after it arrives to your application (populate a data set/read it sequentially/?).

There can be some potential for improvement on the side of the processing application.

You should answer yourself these questions:

Do you need to hold all the returned data in memory at once? How much memory do you allocate per row on the retrieving side? How much memory do you need at once? Can you reuse some memory?

我要还你自由 2024-09-21 09:09:30

有几件事

需要所有列,人们通常会这样做 SELECT * 因为他们懒得列出表中 15 列中的 5 列。

获得更多 RAM,RAM 越多,缓存中可以保存的数据就越多,这比从磁盘读取快 1000 倍

A couple of things

do you need all the columns, people usually do SELECT * because they are too lazy to list 5 columns of the 15 that the table has.

Get more RAM, themore RAM you have the more data can live in cache which is 1000 times faster than reading from disk

蹲在坟头点根烟 2024-09-21 09:09:30

对我来说,您可以做两件事,

  1. 根据日期列对表进行水平分区

  2. 使用预聚合的概念。

预聚合:
在 preagg 中,您将有一个“logs”表、“logs_temp”表、一个“logs_summary”表和一个“logs_archive”表。日志和logs_temp表的结构是相同的。应用程序的流程是这样的,所有日志都记录在日志表中,然后每小时运行一个 cron 作业,执行以下操作

:将数据从日志表复制到“logs_temp”表并清空日志表。这可以使用影子表技巧来完成。

b.从logs_temp 表中聚合该特定小时的日志

c.将汇总结果保存在汇总表中

d.将记录从logs_temp 表复制到logs_archive 表,然后清空logs_temp 表。

这样,结果就会预先汇总在汇总表中。

每当您想要选择结果时,您都可以从汇总表中选择它。

这样,选择速度非常快,因为记录数量要少得多,因为数据是每小时预先聚合的。您甚至可以将阈值从一小时增加到一天。这一切都取决于您的需求。

现在插入也会很快,因为日志表中的数据量并不多,因为它仅保存最后一小时的数据,因此与非常大的数据集相比,插入上的索引重新生成将花费更少的时间使刀片快速插入。

您可以在此处阅读

我使用的 Shadow Table 技巧的更多信息基于wordpress构建的新闻网站中的预聚合方法。我必须为新闻网站开发一个插件,它会显示最近流行的(最近 3 天内流行的)新闻项目,每天有大约 10 万次点击,这个预聚合的东西确实帮了我们很多。查询时间从超过 2 秒缩短到不到 1 秒。我打算尽快公开该插件。

For me there are two things that you can do,

  1. Partition the table horizontally based on the date column

  2. Use the concept of pre-aggregation.

Pre-aggregation:
In preagg you would have a "logs" table, "logs_temp" table, a "logs_summary" table and a "logs_archive" table. The structure of logs and logs_temp table is identical. The flow of application would be in this way, all logs are logged in the logs table, then every hour a cron job runs that does the following things:

a. Copy the data from the logs table to "logs_temp" table and empty the logs table. This can be done using the Shadow Table trick.

b. Aggregate the logs for that particular hour from the logs_temp table

c. Save the aggregated results in the summary table

d. Copy the records from the logs_temp table to the logs_archive table and then empty the logs_temp table.

This way results are pre-aggregated in the summary table.

Whenever you wish to select the result, you would select it from the summary table.

This way the selects are very fast, because the number of records are far less as the data has been pre-aggregated per hour. You could even increase the threshold from an hour to a day. It all depends on your needs.

Now the inserts would be fast too, because the amount of data is not much in the logs table as it holds the data only for the last hour, so index regeneration on inserts would take very less time as compared to very large data-set hence making the inserts fast.

You can read more about Shadow Table trick here

I employed the pre-aggregation method in a news website built on wordpress. I had to develop a plugin for the news website that would show recently popular (popular during the last 3 days) news items, and there are like 100K hits per day, and this pre-aggregation thing has really helped us a lot. The query time came down from more than 2 secs to under a second. I intend on making the plugin publically available soon.

洛阳烟雨空心柳 2024-09-21 09:09:30

根据其他答案,除非您确实需要所有字段,否则不要使用“select *”。

logLevel和creationData有索引

您需要一个包含这两个值的索引,放置它们的顺序会影响性能,但假设您有少量可能的loglevel 值(并且数据没有倾斜),您将获得更好的性能首先创建数据。

请注意,最佳情况下,索引将减少 log(N) 查询的成本,即随着记录数量的增加,它仍然会变慢。

C.

As per other answers, do not use 'select *' unless you really need all the fields.

logLevel and creationData have indexes

You need a single index with both values, what order you put them in will affect performance, but assuming you have a small number of possible loglevel values (and the data is not skewed) you'll get better performance putting creationData first.

Note that optimally an index will reduce the cost of a query to log(N) i.e. it will still get slower as the number of records increases.

C.

念三年u 2024-09-21 09:09:30

我真的希望 creationData 指的是 creationDate

首先,在 logLevelcreationData 上拥有索引是不够的。如果您有 2 个单独的索引,Oracle 将只能使用 1 个。
您需要的是两个字段上的单个索引:

CREATE INDEX i_log_1 ON log (creationData, logLevel);

请注意,我将creationData放在第一位。这样,如果您只将该字段放在 WHERE 子句中,它仍然可以使用索引。 (仅按日期进行过滤似乎比仅按日志级别进行过滤更有可能)。

然后,确保表中填充了数据(与生产中使用的数据一样多)并刷新表上的统计信息。

如果表很大(至少几十万行),使用下面的代码刷新统计信息:

DECLARE
  l_ownname          VARCHAR2(255) := 'owner'; -- Owner (schema) of table to analyze
  l_tabname          VARCHAR2(255) := 'log'; -- Table to analyze
  l_estimate_percent NUMBER(3) := 5;  -- Percentage of rows to estimate (NULL means compute)
BEGIN
  dbms_stats.gather_table_stats (
     ownname => l_ownname ,
      tabname => l_tabname,
      estimate_percent => l_estimate_percent,
      method_opt => 'FOR ALL INDEXED COLUMNS',
      cascade => TRUE
  );
END;

否则,如果表很小,使用

ANALYZE TABLE log COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;

另外,如果表变大,你应该考虑在creationDate上按范围分区柱子。有关详细信息,请参阅以下链接:

I really hope that by creationData you mean creationDate.

First of all, it is not enough to have indexes on logLevel and creationData. If you have 2 separate indexes, Oracle will only be able to use 1.
What you need is a single index on both fields:

CREATE INDEX i_log_1 ON log (creationData, logLevel);

Note that I put creationData first. This way, if you only put that field in the WHERE clause, it will still be able to use the index. (Filtering on just date seems more likely scenario that on just log level).

Then, make sure the table is populated with data (as much data as you will use in production) and refresh the statistics on the table.

If the table is large (at least few hundred thousand rows), use the following code to refresh the statistics:

DECLARE
  l_ownname          VARCHAR2(255) := 'owner'; -- Owner (schema) of table to analyze
  l_tabname          VARCHAR2(255) := 'log'; -- Table to analyze
  l_estimate_percent NUMBER(3) := 5;  -- Percentage of rows to estimate (NULL means compute)
BEGIN
  dbms_stats.gather_table_stats (
     ownname => l_ownname ,
      tabname => l_tabname,
      estimate_percent => l_estimate_percent,
      method_opt => 'FOR ALL INDEXED COLUMNS',
      cascade => TRUE
  );
END;

Otherwise, if the table is small, use

ANALYZE TABLE log COMPUTE STATISTICS FOR ALL INDEXED COLUMNS;

Additionally, if the table grows large, you shoud consider to partition it by range on creationDate column. See these links for the details:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文