如何对海量数据创建索引（mysql）

发布于 2024-12-09 05:42:36 字数 1592 浏览 1 评论 0原文

我目前正在评估存储供应商目录的策略。目录中可以有多个项目，从 100 到 25 万不等。每个项目可能有多个错误。应用程序应支持浏览目录项

按错误类型、类别、制造商、供应商等分组。
浏览任何组的项目，应该能够对多个列（partid、名称、价格等..)

问题是当我必须提供“多重搜索、排序和分组”功能时我应该如何创建索引。

根据 mysql doc &索引的博客似乎在单个列上创建索引不会被所有查询使用。

创建多列索引甚至不特定于我的情况。

可能有 20 - 30 组搜索和搜索的组合。种类。

我如何扩展以及如何快速搜索。

预计处理5000万条数据记录。

目前正在评估 1500 万条数据。

欢迎提出建议。

CREATE TABLE CATALOG_ITEM
(
    AUTO_ID BIGINT PRIMARY KEY AUTO_INCREMENT,
    TENANT_ID VARCHAR(40) NOT NULL,
    CATALOG_ID VARCHAR(40) NOT NULL,
    CATALOG_VERSION INT NOT NULL,
    ITEM_ID VARCHAR(40) NOT NULL,
    VERSION INT NOT NULL,
    NAME VARCHAR(250) NOT NULL,
    DESCRIPTION VARCHAR(2000) NOT NULL,
    CURRENCY VARCHAR(5) NOT NULL,
    PRICE DOUBLE NOT NULL,
    UOM VARCHAR(10) NOT NULL,
    LEAD_TIME INT DEFAULT 0,
    SUPPLIER_ID VARCHAR(40) NOT NULL,
    SUPPLIER_NAME VARCHAR(100) NOT NULL,
    SUPPLIER_PART_ID VARCHAR(40) NOT NULL,
    MANUFACTURER_PART_ID VARCHAR(40),
    MANUFACTURER_NAME VARCHAR(100),
    CATEGORY_CODE VARCHAR(40) NOT NULL,
    CATEGORY_NAME VARCHAR(100) NOT NULL,
    SOURCE_TYPE INT DEFAULT 0,
    ACTIVE BOOLEAN,
    SUPPLIER_PRODUCT_URL VARCHAR(250),
    MANUFACTURER_PRODUCT_URL VARCHAR(250),
    IMAGE_URL VARCHAR(250),
    THUMBNAIL_URL VARCHAR(250),
    UNIQUE(TENANT_ID,ITEM_ID,VERSION),
    UNIQUE(TENANT_ID,CATALOG_ID,ITEM_ID)
);

CREATE TABLE CATALOG_ITEM_ERROR
(
    ITEM_REF BIGINT,
    FIELD VARCHAR(40) NOT NULL,
    ERROR_TYPE INT NOT NULL,
    ERROR_VALUE VARCHAR(2000)
);

原文

I am currently evaluating strategy for storing supplier catalogs.
There can be multiple items in catalog vary from 100 to 0.25million.
Each item may have multiple errors. application should support browsing of catalog items

Group by Type of Error, Category, Manufacturer, Suppliers etc..
Browse items for any group, Should be able to sort and search on multiple columns (partid,
names, price etc..)

Question is when i have to provide functionality of "Multiple SEARCH and SORT and GROUP" how should i create index.

According to mysql doc & blogs for index it seems that creating index on individual column will not be used by all query.

Creating multi column index is even not specific for my case.

There might be 20 - 30 combination of group search & sort.

How do i scale and how can i make search fast.

Expecting to handle 50 million records of data.

Currently evaluating on 15 million of data.

Suggestions are welcome.

CREATE TABLE CATALOG_ITEM
(
    AUTO_ID BIGINT PRIMARY KEY AUTO_INCREMENT,
    TENANT_ID VARCHAR(40) NOT NULL,
    CATALOG_ID VARCHAR(40) NOT NULL,
    CATALOG_VERSION INT NOT NULL,
    ITEM_ID VARCHAR(40) NOT NULL,
    VERSION INT NOT NULL,
    NAME VARCHAR(250) NOT NULL,
    DESCRIPTION VARCHAR(2000) NOT NULL,
    CURRENCY VARCHAR(5) NOT NULL,
    PRICE DOUBLE NOT NULL,
    UOM VARCHAR(10) NOT NULL,
    LEAD_TIME INT DEFAULT 0,
    SUPPLIER_ID VARCHAR(40) NOT NULL,
    SUPPLIER_NAME VARCHAR(100) NOT NULL,
    SUPPLIER_PART_ID VARCHAR(40) NOT NULL,
    MANUFACTURER_PART_ID VARCHAR(40),
    MANUFACTURER_NAME VARCHAR(100),
    CATEGORY_CODE VARCHAR(40) NOT NULL,
    CATEGORY_NAME VARCHAR(100) NOT NULL,
    SOURCE_TYPE INT DEFAULT 0,
    ACTIVE BOOLEAN,
    SUPPLIER_PRODUCT_URL VARCHAR(250),
    MANUFACTURER_PRODUCT_URL VARCHAR(250),
    IMAGE_URL VARCHAR(250),
    THUMBNAIL_URL VARCHAR(250),
    UNIQUE(TENANT_ID,ITEM_ID,VERSION),
    UNIQUE(TENANT_ID,CATALOG_ID,ITEM_ID)
);

CREATE TABLE CATALOG_ITEM_ERROR
(
    ITEM_REF BIGINT,
    FIELD VARCHAR(40) NOT NULL,
    ERROR_TYPE INT NOT NULL,
    ERROR_VALUE VARCHAR(2000)
);

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

肩上的翅膀 2024-12-16 05:42:36

如果您决定仅在 MySQL 中执行此操作，那么您应该创建适用于所有查询的索引。如果有 20-30 个不同的查询进行排序，那么有 20 或 30 个索引就可以了。但你可能可以用比这少得多的索引来做到这一点。

您还需要计划如何维护这些索引。我假设因为这是针对供应商目录的，所以数据不会发生太大变化。在这种情况下，只需创建所需的所有索引就可以很好地完成工作。如果要实时频繁编辑或插入数据行，那么您必须在索引中考虑这一点 - 那么拥有 20 或 30 个索引可能不是一个好主意（因为 MySQL 将不断更新它们）。您还必须考虑使用哪种 MySQL 存储引擎。如果您的数据永远不会改变，MyISAM（默认引擎，基本上是快速平面文件）是一个不错的选择。如果它变化很大，那么你应该使用 InnoDB，这样你就可以获得行级锁定。 InnoDB 还允许您定义聚集索引，这是一个特殊的索引，用于控制在磁盘上存储内容的顺序。因此，如果您有一个 99% 的时间都在运行的特定查询，您可以为其创建一个聚集索引，并且所有数据都已经在磁盘上以正确的顺序排列，并且返回速度超级快。但是，对数据的每次插入或更新都会导致整个表在磁盘上重新排序，这对于大量数据来说并不快。如果数据经常更改，您永远不会使用它，并且您可能必须批量加载数据更新（例如供应商的百万行的新版本）。同样，这取决于您是从不更新、时不时更新还是不断实时更新。

最后，您应该考虑使用其他方法而不是在 MySQL 中执行此操作。现在有很多非常好的搜索产品，例如 Apache Solr 或 Sphinx（在上面的评论中提到），它们可以让您在编写搜索界面本身时变得更加轻松。您可以在其中之一中对目录进行索引，然后使用它们提供一些非常出色的搜索功能，例如全文和/或分面搜索。这就像有一个私人谷歌搜索引擎索引你的东西，是描述它们如何工作的好方法。编写与搜索服务器交互的代码需要时间，但您很可能会节省时间，而不必编写索引问题和我上面提到的其他问题并解决这些问题。

如果您只是创建所有索引，请了解如何在 MySQL 中使用 EXPLAIN 命令。这将使您看到 MySQL 执行查询的计划是什么。您可以创建索引，然后对查询重新运行 EXPLAIN 并查看 MySQL 将如何使用它们。通过这种方式，您可以确保每个查询方法都有支持它的索引，并且不会回退到扫描整个数据表来查找内容。对于您所讨论的行数，每个查询都必须能够使用索引来查找其数据。如果你做对了，它就会表现得很好。

If you are determined to do this solely in MySQL, then you should be creating indexes that will work for all your queries. It's OK to have 20 or 30 indexes if there are 20-30 different queries doing your sorting. But you can probalby do it with far less indexes than that.

You also need to plan how these indexes will be maintained. I'm assuming because this is for supplier catalogs that the data is not going to change much. In this case, simply creating all the indexes you need should do the job nicely. If the data rows are going to be edited or inserted frequently in realtime, then you have to consider that with your indexing - then having 20 or 30 indexes might not be such a good idea (since MySQL will be constantly having to update them all). You also have to consider which MySQL storage engine to use. If your data never changes, MyISAM (the default engine, basically fast flat files) is a good choice. If it changes a lot, then you should be using InnoDB so you can get row level locking. InnoDB would also allow you to define a clustered index, which is a special index that controls the order stuff is stored on disk. So if you had one particular query that is run 99% of the time, you could create a clustered index for it and all the data would already be in the right order on disk, and would return super super fast. But, every insert or update to the data would result in the entire table being reordered on disk, which is not fast for lots of data. You'd never use one if the data changed at all frequently, and you might have to batch load data updates (like new versions of a supplier's million rows). Again, it comes down to whether you will be updating it never, now and then, or constantly in realtime.

Finally, you should consider alternative means than doing this in MySQL. There are a lot of really good search products out there now, such as Apache Solr or Sphinx (mentioned in a comment above), which could make your life a lot easier when coding up the search interfaces themselves. You could index the catalogs in one of these and then use them provide some really awesome search features like full text and/or faceted search. It's like having a private google search engine indexing your stuff, is a good way to describe how these work. It takes time to write the code to interface with the search server, but you will most likely save that time not having to write and wrap your head around the indexing problem and other issues I mentioned above.

If you do just go with creating all the indexes though, learn how to use the EXPLAIN command in MySQL. That will let you see what MySQL's plan for executing a query will be. You can create indexes then re-run EXPLAIN on your queries and see how MySQL is going to use them. This way you can make sure that each of your query methods has indexes supporting it, and is not falling back to a scanning your entire table of data to find things. With as many rows as you're talking about, every query MUST be able to use indexes to find its data. If you get those right, it'll perform fine.

回复收藏 0 原文

~没有更多了~