搜索跨越很多表的数据 - 设计问题

发布于 2024-09-11 13:12:19 字数 597 浏览 7 评论 0原文

我有大约 10 张表的结构。这种结构非常适合数据输入。但是,我还需要对该数据集执行复杂且快速的搜索。我可以想到三种方法:

  1. 在 select 中加入所有这些表。这非常慢并且可能不是一个好方法。如果相关,数据库是Informix;我研究过创建视图,希望它们能得到更优化,但测试表明视图上的选择甚至比许多联接更慢。也许有某种方法可以使 Informix 预连接表并在这些表上创建索引,但从我所看到的来看,这不太可能。我已经做了一些初步测试,看起来视图甚至比连接还要慢,但也许我缺少一些 Informix 选项。连接和视图都比方法 #2 慢:

  2. 定期更新的单个合成表。这似乎是正确的方法,特别是因为搜索不需要实时数据 - 实际上,我可能可以每天更新合成表。数据大小约为 500k-1000k 行。

  3. Memcached 和类似的内存解决方案。目前还没有这样的基础设施,这可能不保证实现它,但是一旦合成表变得太慢,我就会考虑这一点。此外,有很多搜索参数,甚至第一次查询也必须很快,因此这种方法必须急切地缓存所有数据。当然,即使使用方法 1 和 2,我也可能会缓存任何可以缓存的内容。

我想听听您对此的想法。我是否缺少一颗神奇的子弹?类似情况你用过什么?

I have a structure of about 10 tables. This structure works great for data entry. However, I also need to preform complex and fast searches on that dataset. There are three approaches to this I can think of:

  1. Join all those tables in select. This is pretty slow and likely not a good approach. If it's relevant, database is Informix; I've looked into creating views hoping that they'd be more optimized, but testing shows that selects on views are even slower than a lot of joins. Maybe there is some way to make Informix pre-join tables and create indexes on those, but from what I've seen it's not likely. I've done some preliminary testing and it seems that view is even slower than joins, but maybe I'm missing some Informix options. Both joins and view are slower than the approach #2:

  2. Single synthetic table which is updated periodically. This seems the right approach, especially since searches don't need to be on real-time data - actually, I can probably get away with updating synthetic table daily. Data size would be about 500k-1000k rows.

  3. Memcached and similiar in-memory solutions. At the moment there is no such infrastructure in place, and this probably doesn't warrant implementing it, however this is something I'll look at once the synthetic table becomes too slow. Also, there are lots of search parameters and even first query has to be fast, so this approach will have to eagerly cache all data. Of course, I'll probably cache anything I can even with approaches 1 and 2.

I'd like your thoughts on this. Is there a magic bullet I'm missing? What have you used in similar situations?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

め七分饶幸 2024-09-18 13:12:19

选项 1。

根据表中的数据量,应该能够在合理的时间内连接 10 个表。对你来说多慢才算太慢?

为了确保查询顺利运行,您可以做的最重要的两件事。

首先确保您的逻辑表设计确实符合逻辑。糟糕的表设计和糟糕的列设计会导致数据库应用程序出现大量不必要的速度下降。数据输入运行良好这一事实有力地表明您的表格设计非常好。你的设计标准化了吗?或者某种程度的标准化?

其次,创建正确的索引。根据具体情况,正确的索引可以使查询运行速度提高一百倍。为了构建正确的索引,您需要了解一些有关索引如何工作、所给出的查询、数据量以及 DBMS 在执行查询时选择的策略的知识。

选项 2。

这可能是您最好的选择。了解一些有关数据集市或数据仓库的知识。这就是数据库人员处理设计问题的方式,涉及一种数据输入模式、一种不同的查询模式以及保持两种模式同步的过程。

这里存在许多设计问题,我不会尝试列举它们,而是建议您深入研究数据集市。

Option 1.

Depending on the volume of data in your tables, 10 tables should be able to be joined in a reasonable amount of time. How slow is too slow for you?

Here are the biggest two things you can do to make sure your queries are running smoothly.

First make sure your logical table design is really logical. Bad table design and bad column design are responsible for a large amount of unnecessary slowdowns in database apps. The fact that data entry is working well is a pretty strong indication that your table design is pretty good. Is your design normalized? Or somewhat normalized?

Second, create the right indexes. The right indexes can make a query run a hundred times faster, depending on the circumstances. In order to build the right indexes you need to know a little bit about how indexes work, about the query you are giving, about the volume of data, and about the strategy that the DBMS chooses when executing the query.

Option 2.

This may well be your best bet. Learn a little about data marts or data warehouses. That's how database people deal with design issues involving one schema for data entry, a different schema for queries, and a process to keep the two schemas in synch.

There are a number of design issues here, and rather than try to enumerate them, I'm just going to suggest that you bone up on data marts.

淡水深流 2024-09-18 13:12:19

内存数据库将毫秒级的数据库访问时间转化为微秒级的访问时间。这是自动交易系统还是 911 调度或航空交通管制系统?如果不是,您将很难表明对微秒访问时间的要求。

当沃尔特说“多慢才算太慢”时,他说得对。明确定义您的需求,这是内部还是外部 SLA?你有要求吗?或者这只是“感觉”太慢了。

学习阅读执行计划并检查慢速查询的计划。是否存在这样的基数估计?当您知道有 100k 行时,它是否会预期 1 行?它是否对您期望从中获取 1 行的表进行全表扫描?

如果查询看起来尽可能高效,请跟踪它...看看您是否可以识别是否存在您不期望的时间消耗。单独完成时表现良好,但在负载下表现不佳吗?老实说,一开始数据不多的 10 个表确实不应该超级慢。

我认为吉尔伯特高估了你的问题。由于记录不超过 100 万条,全维模型似乎有点大材小用。对于你的问题的语气,听起来你真的只是想加快一个或三个查询的速度——而不是创建整个 BI 平台的开始。如果是这种情况,请回顾一下解释计划,看看您是否可以确定可以通过预先计算某些连接(非规范化)来减少的主要工作量,构建新的物化视图...尝试查询,如果如果没有改进,那就放弃它并尝试其他东西......不要继续在不成功的尝试上继续前进。

现在我看到旅游业的评论

所以你有2个舱位的房间,30个双人间和20个单人间,飞机上有80个座位。但双人间可以加一张床,所以你可能会在房间用完之前就用完座位。

Rooms Remaining
---------------
5 Single Remain
10 Doubles Remain

Seats Remaining
---------------
8 Plane seats

由于有一架飞机和 2 种房间类型,因此您只需将它们放在一起进行笛卡尔坐标系即可。

Package Type       Rooms      Seats      Packages Available
------------       ------     -----      ------------------
 Single              5           8             5
 Double              10          8             8

请注意,可用的套餐是一个简单的计算最低(房间,座位)

在您的评论中您说

即使有空房,套餐也已正式售罄。

Package Type       Rooms      Seats      Packages Available
------------       ------     -----      ------------------
 Single              5           0             0
 Double              0           0             0

情况是这样的……你已经住满了双人间,其中 5 个是三人间……所以飞机已满,还有 5 个额外的单人间。但我们的最低计算表明没有可用的单一套餐。

我很接近吗?

In-memory databases take millisecond database access time and turn them into microsecond access time. Is this an automated trading system or 911 dispatch or aviation traffic control system? If not you'd be hard pressed to show a requirement for microsecond access times.

Walter has it correct when he said "how slow is too slow?" Define your requirements clearly, is this an internal or external SLA? Do you have requirements? or does this just 'feel' too slow.

Learn to read an execution plan and examine the plan for your slow query. Is there a cardinality estimate that way off? Does it anticipate 1 row when you know there are 100k rows? Is it doing a full table scan on a table you expect 1 row from?

If the query looks as efficient as it can be, trace it... see if you can identify if there are any time sinks that you're not expecting. Is it fine when done solo, but poor performing under load? Honestly, 10 tables with not a lot of data to begin with really shouldn't be super slow.

I think Gilbert is overestimating your problem. With no more than 1M records, a full dimensional model seems overkill. For the tone of your question it sounds like you're really just trying to speed up a query or three - not creating the start of an entire BI platform. If that's the case, look back at the explain plan, see if you can identify major amounts of work that could be reduced via the pre-calculation of some joins (denormalization), build that new materialize view... try the query, if no improvement then drop that and try something else... do not keep building on unsuccessful attempts.

Now I see the travel industry comment

So you have 2 classes of room, 30 doubles and 20 singles, and you have 80 seats on the plane. But the doubles can add an extra bed so you might run out of seats before you run out of rooms.

Rooms Remaining
---------------
5 Single Remain
10 Doubles Remain

Seats Remaining
---------------
8 Plane seats

Since there's one plane and 2 room types, you'll just Cartesian those together.

Package Type       Rooms      Seats      Packages Available
------------       ------     -----      ------------------
 Single              5           8             5
 Double              10          8             8

Notice the Packages available is a simple calculation LOWEST(Rooms, Seats)

In your comment you said

even if rooms are available, the package is officially sold out.

Package Type       Rooms      Seats      Packages Available
------------       ------     -----      ------------------
 Single              5           0             0
 Double              0           0             0

So here's that case... you've filled the double rooms and 5 of them are triples... so the plane is full and there are 5 extra single rooms. But our LOWEST calculation does the work to show there are no Single Packages available.

am I close?

筱果果 2024-09-18 13:12:19

你是在正确的轨道上。

对此没有灵丹妙药,因为您的桌子确实很分散。我过去所做的就是执行类似于选项 2 的操作。

假设我有一个帐户表,其中 AccountID 作为 PK。我将创建另一个名为 AccountSearch 的表,该表将与 Accounts 建立多对一关系。 AccountSearch 将包含一组字符串及其关联的 ID。

如果您想要更模糊的搜索,您还可以使用 NYIISSoundex (恶心)或只是删除空格。您还可以实现全文搜索,但这通常是矫枉过正。

Account
-------
AccountID (PK)
Name
OwnerName

AccountSearch
-------------
SearchString (PK)
AccountID (PK)

You're right on track.

There's no magic bullet for this, because your tables are really spread out. What I've done in the past is do something like your Option 2.

Let's say I have a table of Accounts with an AccountID as the PK. I'd create another table called AccountSearch which would be related in a many-to-one relationship with Accounts. AccountSearch would contain a set of strings and their associated IDs.

If you wanted fuzzier searching, you could also manipulate the strings using NYIIS or Soundex (yuck) or simply removing whitespace. You could also implement full-text searching, but that is often overkill.

Account
-------
AccountID (PK)
Name
OwnerName

AccountSearch
-------------
SearchString (PK)
AccountID (PK)
睫毛上残留的泪 2024-09-18 13:12:19

选项 2 称为数据集市或数据仓库。对于额外存储的成本,您可以拥有一个操作数据库和一个查询数据库。

既然您说有很多搜索参数,您可以使用星型模式< /a>,基于搜索参数。

Option 2 is called a data mart or a data warehouse. For the cost of additional storage, you can have an operational database and a query database.

Since you say there are lots of search parameters, you can create your query tables using a star schema, based on the search parameters.

等风来 2024-09-18 13:12:19

您多久需要在所有表上搜索条件?

可以提高性能的一种工作方法是确保主查询适应搜索条件,仅连接必要的表,并仅从主表中检索主键值。该数据可能会保存到临时表中,或滚动游标中,或提取回客户端。

然后,当您需要收集用于显示的信息时,您可以使用(准备好的)SELECT 来准确收集您所需的行所需的数据。

这样做的优点是(对于许多查询)您很少在所有 10 个表上指定条件,因此在发现相关记录时不需要进行 10 路联接。单行操作都是键上的连接,因此查找无需扫描即可建立索引。

显然,您可以兼顾这些标准;您可以从主表中选择所有数据,并从其中一个辅助表中选择所有相关值(始终需要),但决定不从其他 8 个表中选择任何值(因为它们并不总是出现在搜索条件),或类似的其他变体。

这假设您可以构建动态 SQL,但这很少会成为问题。

How often do you need search criteria on all the tables?

One way of working that may improve performance is to make sure the main query adapts to the search criteria, only joining the necessary tables, and retrieving just the primary key values from the main table. This data might be saved into a temporary table, or in a scroll cursor, or fetched back to the client.

Then, when you need to collect the information for display, you use a (prepared) SELECT that collects exactly the data you need for the row(s) you need.

The advantage of this is that (for many queries) you seldom specify conditions on all 10 tables, so you don't need to do the 10-way join while discovering the relevant records. And the single-row operation is all joins on keys so the lookups are indexed with no scanning.

Clearly, you can juggle the criteria; you might select all the data from the primary table, and all the relevant values from one of the secondary tables (which is always needed) but decide not to select any values from the other 8 tables (because they don't always appear in the search criteria), or other variants along these lines.

This assumes you can build dynamic SQL, but that is very seldom an issue.

尹雨沫 2024-09-18 13:12:19

过去我使用过类似于#2 的实现。您可以尝试创建一个视图,该视图基本上由每个表的可搜索字段组成,例如。

SELECT Name From Person
UNION SELECT Name FROM Company

然后将该视图输入全文索引产品,例如 Sphinx,它可以优化您的搜索并提供灵活的选项权重、术语等......以及安排索引的更新频率。

In the past I have used an implementation simillar to #2. You could try creating a view which would basically consist of the searchable fields for each table eg.

SELECT Name From Person
UNION SELECT Name FROM Company

Then feed that view into a full text indexing product such as Sphinx which can optimize your searching and provide flexible options for weights, terms etc.. as well as scheduling how often your indexes are updated.

夏见 2024-09-18 13:12:19

Consolidate your 10 tables into one temporary table..
See: Should I denormalize Loans, Purchases and Sales tables into one table?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文