当前位置：文江博客话题详情

Sphinx 搜索的数据库定义

发布于 2024-12-11 06:46:15 字数 2167 浏览 4 评论 0原文

背景

我正在创建一个 MySQL 数据库来存储课程等项目，其中单个课程可能有许多属性。例如：

单个课程可能具有以下任何或全部属性：

标题 (varchar)
次要标题 (varchar)
说明（文本）
日期
时间
特定位置 (varchar;例如，White Hall Room 7）
一般位置（varchar；例如，拉斯维加斯，内华达州）
位置坐标（浮点数；例如纬度、经度）
等。

数据库设置如下：

存储特定课程信息的表：

课程表：

Course_ID（每个课程的主键唯一 ID）
Creator_ID（创建者的唯一 ID）
Creation_Date（课程创建的日期时间）
Modified_Date（这是修改课程的最新时间戳）

存储的表每个课程的多个属性设置如下：

course_attributes 表：

Attribute_ID（每个属性的唯一ID）
Course_ID（具体课程属性的引用）
Attribute（varchar 定义属性；例如“标题”）
值（包含指定属性值的文本；例如“我的课程标题”）

我想

使用 sphinx 搜索来搜索该数据库。通过此搜索，我有不同的字段权重不同，例如：“标题”比“描述”更重要。

我希望拥有的具体搜索字段是：

标题
日期
位置（字符串）
位置（地理 - 纬度/经度）

问题

我应该在 Mysql 中定义一个视图来根据 ' 组织属性标题'、'描述'等，或者有没有办法定义我的 sphinx.conf 文件以了解特定属性？

我愿意接受解决此问题的所有建议问题，是否是重新排列数据库/表或我的搜索方式。

如果您需要任何其他详细信息来帮助我找到解决方案，请告诉我。

提前感谢您的帮助

!--更新--!

好的，在阅读了一些答案后，我觉得我应该提供一些额外的信息。

纬度/经度

纬度/经度属性是我在收到大致位置字符串后在内部创建的。我可以按照我希望的任何方式生成值，这意味着我可以将它们一起存储在单个纬度/经度属性中，作为“浮点纬度、浮点长”值或任何其他所需的格式。仅当从初始位置字符串生成并验证它们后，才会执行此操作。这是为了防止 @X-Zero 和 @Cody 所建议的格式错误的数据。

请记住，纬度和经度只是说明需要使该字段可搜索，而不是其他任何内容。它只是另一个属性；许多之一。

对搜索结果进行加权

我知道如何为 Sphinx 搜索查询中的结果添加权重：

$cl->setFieldWeights( array('title'=>1000, 'description'=>500) );

这会导致 title 列的权重高于 description 列（如果结构如 @X-Zero 建议的那样）。我的问题更直接地涉及如何将上述逻辑应用于当前的表定义。

数据库结构、视图和效率

利用我对视图的介绍性知识，我想我可以创建一些东西，为每个课程显示一行，其中每个属性都是自己的柱子。我不知道如何实现这一点，或者是否可能。

我对数据库结构不是最有信心，但我按照描述设置表格的原因是因为在很多情况下，并非每门课程的所有字段都会填写，而我尝试要有效率[是的，看来我失败了]。

我认为使用当前的结构，每个属性都将包含一个值，因此不会导致表中浪费空间。或者，如果我有一个包含大量潜在属性的表，我会认为会浪费空间。如果我错了，我很高兴知道为什么我的理解是错误的。

原文

Background

I am creating a MySQL database to store items such as courses where there may be many attributes to a single course. For example:

A single course may have any or all of the following attributes:

Title (varchar)
Secondary Title (varchar)
Description (text)
Date
Time
Specific Location (varchar; eg. White Hall Room 7)
General Location (varchar; eg. Las Vegas, NV)
Location Coords (floats; eg. lat, long)
etc.

The database is set up as follows:

A table storing specific course info:

courses table:

Course_ID (a Primary Key unique ID for each course)
Creator_ID (a unique ID for the creator)
Creation_Date (datetime of course creation)
Modified_Date (where this is the most recent timestamp the course was modified)

The table storing each courses multiple attributes is set up as follows:

course_attributes table:

Attribute_ID (a unique ID for each attribute)
Course_ID (reference to the specific course attribute is for)
Attribute (varchar definining the attribute; eg. 'title')
Value (text containing value of specified attribute; eg. 'Title Of My Course')

Desire

I would like to search this database using sphinx search. With this search, I have different fields weighing different amounts, for example: 'title' would be more important than 'description'.

Specific search fields that I wish to have are:

Title
Date
Location (string)
Location (geo - lat/long)

The Question

Should I define a View in Mysql to organize the attributes according to 'title', 'description', etc., or is there a way to define my sphinx.conf file to understand specific attributes?

I am open to all suggestions to solving this problem, whether it be rearrangement of the database/tables or the way in which I search.

Let me know if you need any additional details to help me find a solution.

Thanks in advance for the help

!--Update--!

OK, so after reading some of the answers, I feel that I should provide some additional information.

Latitude / Longitude

The latitude/longitude attributes are created by me internally after receiving the general location string. I can generate the values in any way I wish, meaning that I can store them together in a single lat/long attribute as 'float lat, float long' values or any other desired format. This is done only after they have been generated from the initial location string and verified. This is to guard against malformed data as @X-Zero and @Cody have suggested.

Keep in mind that the latitude and longitude was merely illustrating the need to have that field be searchable as opposed to anything more than that. It is simply another attribute; one of many.

Weighting Search Results

I know how to add weights to results in a Sphinx search query:

$cl->setFieldWeights( array('title'=>1000, 'description'=>500) );

This causes the title column to have a higher weight than the description column if the structure was as @X-Zero suggested. My question was more directed to how one would apply the above logic with the current table definition.

Database Structure, Views, and Efficiency

Using my introductory knowledge of Views, I was thinking that I could possibly create something that displays a row for each course where each attribute is its own column. I don't know how to accomplish this or if it's even possible.

I am not the most confident with database structures, but the reason I set my tables up as described was because there are many cases where not all of the fields will be completed for every course and I was attempting to be efficient [yes, it seems as though I've failed].

I was thinking that using my current structure, each attribute would contain a value and would therefore cause no wasted space in the table. Alternatively, if I had a table with tons of potential attributes, I would think there would be wasted space. If I am incorrect, I am happy to learn why my understanding is wrong.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

兮子 2024-12-18 06:46:15

首先我要说的是，我从未听说过 Sphinx，也（显然）使用过它。然而，从数据库的角度来看......

像这样做多域列是一个可怕（我会追捕并杀死你）的想法。一方面，不可能进行有意义的索引或排序。您还必须祈祷您不会获得带有文本数据的 latitude 属性（并且因为这只能通过编程方式得到加强，我将保证这种情况会发生） - 这样做会导致所有基于距离的公式来崩溃。说到位置，如果有人存储没有经度的纬度（请注意，无论您是否存储单个GeoLocation，这都是可能的） > 属性，或成对）？

最好的办法是执行以下操作：

找出始终需要哪些属性。这些属于course 表（...大部分）。
对于每组相关可选属性，创建一个表。例如，位置（尽管这可能是必需的...），其中包含纬度/经度、城市、州、地址、房间等。允许列为 null< /code>able（在集合中 - 添加约束，以便用户不能仅添加经度而不添加纬度）。
为每组常见查询添加一个视图。即使（也许特别是）如果您坚持使用当前的设计，也可以使用视图。这促进了数据库的逻辑和物理实现之间的分离。（这假设通过 SQL 进行搜索）然后您将能够通过指定 view_column 为 null 或 view_column = input_parameter 或任何一个来进行搜索。
对于加权搜索（假设动态加权），您的查询将需要使用左连接（也在视图内 - 请记录这一点），并使用准备好的语句主机参数（只是省去麻烦）试图自己逃避事情）。检查每组参数（例如，纬度和经度），并将输入权重分配给新列（每个属性），该新列可以汇总为“总计”列（必须超过某个阈值）。

编辑：

使用视图：
对于您的结构，您通常要做的就是多次左连接到attributes表（每个需要的属性一个），关闭attribute code> （实际上应该是表的 int FK；您不希望其中同时包含“title”和“Title”）并加入 course_id - 值将是包含在选择中。使用这种技术，可以很简单地获取列列表，然后您可以在 Sphinx 中对其进行明显的权重。
这样做的问题是，如果您需要进行任何数据转换 - 您打赌如果类型发生变化，您将能够找到所有转换。当使用强类型列时，这介于琐碎（很可能最终得到一个唯一命名的列）和不必要（视图通常从查询中的字段中获取其数据类型定义）之间；根据您的架构，您可能最终会发现太多误报。

数据库效率：
你是对的，未填充的列是浪费空间。通常，当某些东西是可选的（ish）时，这意味着您可能需要一个额外的表。这就是为什么我建议将位置拆分到它自己的表中：这可以防止不需要位置的事件（...什么？）“浪费”空间，但会强制任何需要位置的事件定义一个位置来指定所有必需的信息。以这种方式拆分还有一个额外的好处：如果多个事件都使用同一位置（...我们希望不是同时发生），交叉引用表将为您节省很多的空间。远远超出您的 attributes 表的能力（毕竟您仍然需要存储每个事件的完整位置）。如果您仍然有很多“可选”属性，我听说 NoSQL 就是为这类事情而设计的（但我还没有真正研究过）。然而，除此之外，额外一张桌子的成本是微不足道的；内部数据的成本可能不会，但所需的空间会根据所存储数据的感知价值进行权衡。请记住，磁盘空间相对便宜 - 开发人员/维护人员时间才是昂贵的。

地址旁注：
您可能想要创建一个address 表。这将完全脱离事件信息，并且将包括（除其他外）预先计算的纬度/经度（在推荐的数据类型中 - 我不知道 >它是什么，但它肯定不是逗号分隔的字符串）。然后，您将拥有一个 event_address 表，该表将是事件及其发生地点之间的交叉引用 - 如果有其他信息（例如 room），那么应该保存在引用的location表中（而不是直接引用address）。一旦计算出纬度/经度值，您就永远不需要更改它。

关于纬度/经度后续更新的想法：
虽然自己指定纬度/经度值更好，但您需要将它们作为地址表的必需部分（或者纯粹的纬度/经度表的一部分/补充）。坦率地说，任何类型的多值列（分隔列表）都只会带来麻烦 - 每次搜索它们时都必须解析它们（以及其他相关问题）。一旦你把它们分开排，其中的一个最终会被丢弃——如果有必要，墨菲本人将亲自干预。此外，在不同时间从地址更新它们将导致地址具有不匹配的纬度/经度对；您最好的选择是在插入时计算此信息（有许多网络服务可以为您查找此信息）。

多域表：
对于多域表，您基本上可以赌域键（属性）永远不会与值（错误，值）不同步。我不在乎你有多好，在某个地方，不知何故，它会发生：在我的公司，我们的遗留应用程序中有其中一个（它存储了 FK 链接以及 FK 引用的文件）到，以及一个属性）。在某一时刻，应用程序被安装在生产环境中，该应用程序立即开始存储正确的文件链接，但对于给定的属性类别，FK 链接到不同的文件。值得庆幸的是，另一个文件中有审计记录，可以逆转这一情况（……据他们所知）。

总结：
重新访问您的必需/可选数据。不要害怕创建额外的表，每个表对应一个实体，每列对应一个域；您还需要关系表。您可能还希望将审核数据 (last_updated_time) 放在一组单独的表中（单域表在这方面将有很大帮助）。

Let me preface this by saying that I've never even heard of Sphinx, nor (obviously) used it. However, from a database perspective...

Doing multi-domain columns like this is a terrible (I will hunt you down and kill you) idea. For one thing, it's impossible to index or sort meaningfully, period. You also have to pray that you don't get a latitude attribute with textual data (and because this can only be reinforced programatically, I'm going to garuantee this will happen) - doing so will cause all distance based formulas to crash. And speaking of location, what happens if somebody stores a latitude without a longitude (note that this is possible regardless of whether you are storing a single GeoLocation attribute, or the pair)?

Your best bet is to do the following:

Figure out which attributes will always be required. These belong in the course table (...mostly).
For each related set of optional attributes, create a table. For example, location (although this should probably be required...), which would contain Latitude/Longitude, City, State, Address, Room, etc. Allow the columns to be nullable (in sets - add constraints so users can't add just longitude and not latitude).
For every set of common queries add a view. Even (perhaps especially) if you persist in using your current design, use a view. This promotes seperation between the logical and physical implementations of the database. (This assumes searching by SQL) You will then be able to search by specifying view_column is null or view_column = input_parameter or whichever.
For weighted searching (assuming dynamic weighting) your query will need to use left joins (inside the view as well - please document this), and use prepared-statement host-parameters (just save yourself the trouble of trying to escape things yourself). Check each set of parameters (both lat and long, for example), and assign the input weighting to a new column (per attribute), which can be summed up into a 'total' column (which must be over some threshold).

EDIT:

Using views:
For your structure, what you would normally do is left join to the attributes table multiple times (one for each attribute needed), keying off of the attribute (which should really be an int FK to a table; you don't want both 'title' and 'Title' in there) and joining on course_id - the value would be included as part of the select. Using this technique, it would be simple to then get the list of columns, which you can then apparently weight in Sphinx.
The problem with this is if you need to do any data conversion - you are betting that you'll be able to find all conversions if the type ever changes. When using strongly typed columns, this is between trivial (the likelyhood is that you end up with a uniquely named column) to unnecessary (views usually take their datatype definitions from the fields in the query); with your architecture, you'll likely end up looking through too many false positives.

Database efficiency:
You're right, unfilled columns are wasted space. Usually, when something is optional(ish), that means you may need an additional table. Which is Why I suggested splitting off location into it's own table: this prevents events which don't need a location (... what?) from 'wasting' the space, but then forces any event that defines a location to specify all required information. There's an additional benefit about splitting it off this way: if multiple events all use the same location (... not at the same time, we hope), a cross-reference table will save you a lot of space. Way more than your attributes table ever could (you're still having to store the complete location per event, after all). If you still have a lot of 'optional' attributes, I hear that NoSQL is made for these kinds of things (but I haven't really looked into it). However, other than that, the cost of an additional table is trivial; the cost of the data inside may not be, but the space required is weighed against the perceived value of the data stored. Remember that disk space is relatively cheap - it's developer/maintainer time that is expensive.

Side note for addresses:
You are probably going to want to create an address table. This would be completely divorced from the event information, and would include (among other things) the precomputed latitude/longitude (in the recommended datatype - I don't know what it is, but it's for sure not a comma-separated string). You would then have an event_address table that would be the cross-reference between the events and where they take place - if there is additional information (such as room), that should be kept in a location table that is referenced (instead of referencing address directly). Once a lat/long value is computed, you should never need to change it.

Thoughts on later updates for lat/long:
While specifying the lat/long values yourself is better, you're going to want to make them a required part of the address table (or part of/in addition to a purely lat/long only table). Frankly, multi-value columns (delimited lists) of any sort are just begging for trouble - you keep having to parse them every time you search on them (among other related issues). And the moment you make them separate rows, one of the pair will eventually get dropped - Murphy himself will personally intervene, if necessary. Additionally, updating them at different times from the addresses will result in an address having a lat/long pair that does not match; your best bet is to compute this at insertion time (there are a number of webservices to find this information for you).

Multi-domain tables:
With a multi-domain table, you're basically betting that the domain key (attribute) will never become out-of-sync with the value (err, value). I don't care how good you are, somewhere, somehow, it's going to happen: at my company, we had one of these in our legacy application (it stored FK links and which files the FKs refer to, along with an attribute). At one point an application was installed in production which promptly began storing the correct file links, but the FK links to a different file, for a given class of attribute. Thankfully, there were audit records in another file which allowed this to be reversed (... as near as they were able tell).

In summary:
Revisit your required/optional data. Don't be afraid to create additional tables, each for a single entity, with every column for a single domain; you will also need relationship tables. You may also wish to place your audit data (last_updated_time) in a set of separate tables (single-domain tables will help immensely in this regard).

回复收藏 0 原文