当前位置：文江博客话题详情

如何省略“THE”使用 PHP 和 MYSQL 进行搜索

发布于 2024-08-24 17:27:14 字数 538 浏览 12 评论 0原文

我正在为一个项目做一个“字母顺序搜索”模块。

也就是说它看起来像

ABCDEF 。。。。。。。。。 …… 。。。。。 …… 。。 Z

当我单击“A”时，结果应按“A”排序。这对于所有字母都是一样的。

现在我的问题如下：

例如有一部名为“木乃伊”的电影。
我所做的是，当点击“ALPHABET T”时，相应的电影将被排序。
但我的客户要求是，当用户点击“M”而不是“T”时，《木乃伊》电影必须排序
因为“a、an、the”是“ARTICLES”并且它没有任何含义。

我希望现在每个人都能理解我的问题是什么......

任何帮助都将是值得赞赏和感激的。

提前致谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一片旧的回忆 2024-08-31 17:27:14

假设您不希望修改表的内容（因此查询效率稍低），则以下操作应该可以解决问题。
（如果您确实有时间修改表格，请参阅本答案末尾的建议）

SELECT Title
FROM myTable
WHERE (Title LIKE 'x%' OR Title LIKE 'THE x%')
  -- AND Title NOT LIKE 'THE [^T]%'   
ORDER BY Title

注释：
- x 指定所需的字母（例如：LIKE 'A%' 等）
- 仅当“X”是字母“T”时才需要“AND TITLE NOT LIKE”额外条件（否则功能是多余的，但不会改变结果）
- 我不确定是否支持 [^xyz]（即 NOT 字符 x、y 或 z），因此 [^T]< /code> 可以替换为它的正等价物，例如 [A-RS-Z0-9]。

还有一些其他停用词需要考虑（“A”、“AN”、“OF”...），但对于书籍或电影标题，通常的做法是仅考虑“THE”。如果您必须处理其他文章，则可以将逻辑扩展为：

SELECT Title
FROM myTable
WHERE (Title LIKE 'x%' 
    OR Title LIKE 'THE x%' 
    OR Title LIKE 'A x%' 
    OR Title LIKE 'AN x%') 
 -- the following is only needed when "x" is either the letter T or A.
 -- AND (Title NOT LIKE 'THE [^T]%' 
 --      AND Title NOT LIKE 'A [^A]%' 
 --      AND Title NOT LIKE 'AN [^A]%'
 --  )
ORDER BY Title

如果您可以修改表格的内容，还有更好的解决方案。其中一些意味着预先计算一个或多个额外的列（并在添加新记录等时维护它/这些）。

例如，请参阅本文中 Cletus 的回答，了解“sort_column”方法，其中额外的列列包含的标题已去除任何不需要的主要干扰词。除了作为 OP 首字母搜索问题中的过滤字段之外，此列还可以用于以更友好/合理的方式对由与首字母无关的过滤器生成的标题列表进行排序和/或标题的开头（例如按年份搜索）。
上述的一种变体是仅存储“有效”的首字母（经过不需要的噪音的字母），从而形成一个较小的列，但用途也较差。
标题列本身可以更新，存储标题的修改形式，从而将无关的前导干扰词移动到括号之间的字符串末尾。这种做法在书目类型目录中很常见。

Assuming that you do not wish to modify the content of the table (and hence get slightly less efficient queries), the following should do the trick.
(If you do have the leisure of modifying the table, see the suggestions at the end of this answer)

SELECT Title
FROM myTable
WHERE (Title LIKE 'x%' OR Title LIKE 'THE x%')
  -- AND Title NOT LIKE 'THE [^T]%'   
ORDER BY Title

Notes:
- x designate the desired letter (example: LIKE 'A%' etc.)
- The "AND TITLE NOT LIKE" extra condition is only needed when "X" is the letter "T" (it is otherwise functionally redundant, but doesn't change the result)
- I'm unsure of the support of the the [^xyz] (i.e. NOT characters x, y or z), so the [^T] could be replaced by its positive equivalent say [A-RS-Z0-9].

There are a few other stop words to consider ("A", "AN", "OF"...) but for book or film titles, it is a common practice to only consider "THE". If you must deal with the other articles, the logic can be extended as in:

SELECT Title
FROM myTable
WHERE (Title LIKE 'x%' 
    OR Title LIKE 'THE x%' 
    OR Title LIKE 'A x%' 
    OR Title LIKE 'AN x%') 
 -- the following is only needed when "x" is either the letter T or A.
 -- AND (Title NOT LIKE 'THE [^T]%' 
 --      AND Title NOT LIKE 'A [^A]%' 
 --      AND Title NOT LIKE 'AN [^A]%'
 --  )
ORDER BY Title

There are better solutions, if you can modify the table's contents. Some of these imply pre-computing one or several extra columns (and maintaining it/these, when new records are added etc.).

See for example Cletus' answer in this post for the "sort_column" approach, where the extra column contains the title stripped of any undesirable leading noise-word. In addition to its purpose as the filtering field in the initial letter search problem of the OP, this column can also be used to sort, in a more friendly/sensible way, lists of titles which were produced by a filter unrelated to the initial letter and/or the begining of the title (say a search by year).
A variation on the above is to only store the "effective" initial letter (the one past the undesired noise), making for a smaller column, but a less versatile one too.
The title column itself can be updated, storing a modified form of the title, whereby the extraneous leading noise-word(s) is moved to the end of the string, between parenthesis. This practice is quite common with bibliographic-type catalogs.

回复收藏 0 原文

橘虞初梦 2024-08-31 17:27:14

您在这里真正要问的是如何删除“停用词”（“the”只是一个例子；您需要删除“of”、“a”等）。尝试对停用词集进行硬编码是一件非常痛苦的事情，并且随着语料库的变化，您将不得不更改代码。

相反，您应该尝试使用一种算法来根据您的语料库推断出停用词。执行此类操作的算法是众所周知的，并且被搜索引擎所采用。效果很好的一种称为 TF/IDF

回复收藏 0 原文

丶情人眼里出诗心の 2024-08-31 17:27:14

基本上你是如何做到这一点的，你有一个额外的列用于排序。如果您有一个包含 name 列的 movie 表，请添加另一个名为 sort_name 的列。其中应包含小写的电影标题，并删除前面要忽略的所有单词（例如“the”、“a”）。

不要尝试动态地执行此操作。

当字段更新时，您还必须更新 sort_name 列。您可以随时重建它，当然您必须对其建立索引。然后只需执行以下操作：

SELECT *
FROM movies
WHERE sort_name LIKE 'a%'

Basically how you do this is you have an extra column for sorting. If you have a movie table with a name column, add another column called sort_name. That should contain the movie title in lowercase with any words that you want to ignore from the front removed (eg "the", "a").

Don't try and do this dynamically.

When the field is updated you'll also have to update the sort_name column. You can rebuild it at any time and you will of course have to index it. Then just do:

SELECT *
FROM movies
WHERE sort_name LIKE 'a%'

回复收藏 0 原文

~没有更多了~

关于作者

蓝海似她心

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

如何省略“THE”使用 PHP 和 MYSQL 进行搜索

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

琉璃梦幻

qq_4zWU6L

话少情深

西西弗的石头怪

彻夜缠绵

千寻…

友情链接

如何省略“THE”使用 PHP 和 MYSQL 进行搜索

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

琉璃梦幻

qq_4zWU6L

话少情深

西西弗的石头怪

彻夜缠绵

千寻…

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。