如何省略“THE”使用 PHP 和 MYSQL 进行搜索
我正在为一个项目做一个“字母顺序搜索”模块。
也就是说它看起来像
ABCDEF 。 。 。 。 。 。 。 。 。 …… 。 。 。 。 。 …… 。 。 Z
当我单击“A”时,结果应按“A”排序。这对于所有字母都是一样的。
现在我的问题如下:
例如有一部名为“木乃伊”的电影。
我所做的是,当点击“ALPHABET T”时,相应的电影将被排序。
但我的客户要求是,当用户点击“M”而不是“T”时,《木乃伊》电影必须排序
因为“a、an、the”是“ARTICLES”并且它没有任何含义。
我希望现在每个人都能理解我的问题是什么......
任何帮助都将是值得赞赏和感激的。
提前致谢
I am doing a "ALPHABETICAL ORDER SEARCH" module for a project.
that is it will look like
A B C D E F . . . . . . . . . .. . . . . . . .. . . . Z
When i click on "A" the results should be sort by "A". Which is same for all the alphabets.
Now my prob is as follows:
For example there is a film named "The Mummy".
What i do is when click on the "ALPHABET T" this corresponding film will be sort.
But my client requirement is that "The Mummy" movie must sort when the user clicks on "M" and not "T"
Because "a, an, the" are "ARTICLES" and it does not have any meanings.
I hope now that everybody can understood what my problem is....
Any help will be appreciable and thankful.
thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
假设您不希望修改表的内容(因此查询效率稍低),则以下操作应该可以解决问题。
(如果您确实有时间修改表格,请参阅本答案末尾的建议)
注释:
- x 指定所需的字母(例如:LIKE 'A%' 等)
- 仅当“X”是字母“T”时才需要“AND TITLE NOT LIKE”额外条件(否则功能是多余的,但不会改变结果)
- 我不确定是否支持
[^xyz]
(即 NOT 字符 x、y 或 z),因此[^T]< /code> 可以替换为它的正等价物,例如
[A-RS-Z0-9]
。还有一些其他停用词需要考虑(“A”、“AN”、“OF”...),但对于书籍或电影标题,通常的做法是仅考虑“THE”。如果您必须处理其他文章,则可以将逻辑扩展为:
如果您可以修改表格的内容,还有更好的解决方案。其中一些意味着预先计算一个或多个额外的列(并在添加新记录等时维护它/这些)。
Assuming that you do not wish to modify the content of the table (and hence get slightly less efficient queries), the following should do the trick.
(If you do have the leisure of modifying the table, see the suggestions at the end of this answer)
Notes:
- x designate the desired letter (example: LIKE 'A%' etc.)
- The "AND TITLE NOT LIKE" extra condition is only needed when "X" is the letter "T" (it is otherwise functionally redundant, but doesn't change the result)
- I'm unsure of the support of the the
[^xyz]
(i.e. NOT characters x, y or z), so the[^T]
could be replaced by its positive equivalent say[A-RS-Z0-9]
.There are a few other stop words to consider ("A", "AN", "OF"...) but for book or film titles, it is a common practice to only consider "THE". If you must deal with the other articles, the logic can be extended as in:
There are better solutions, if you can modify the table's contents. Some of these imply pre-computing one or several extra columns (and maintaining it/these, when new records are added etc.).
您在这里真正要问的是如何删除“停用词”(“the”只是一个例子;您需要删除“of”、“a”等)。尝试对停用词集进行硬编码是一件非常痛苦的事情,并且随着语料库的变化,您将不得不更改代码。
相反,您应该尝试使用一种算法来根据您的语料库推断出停用词。执行此类操作的算法是众所周知的,并且被搜索引擎所采用。效果很好的一种称为 TF/IDF
What you really are asking here is how to remove "stop words" ("the" is just one example; you will want to remove "of", "a", etc). Trying to hard-code the set of stop words is a HUGE pain in the butt, and as your corpus changes, you'll have to change the code.
Instead, you should try to use an algorithm that will deduce what the stop words are based on your corpus. Algorithms to do this kind of a thing are well known and are employed by search engines. One that works very well is called TF/IDF
基本上你是如何做到这一点的,你有一个额外的列用于排序。如果您有一个包含
name
列的movie
表,请添加另一个名为sort_name
的列。其中应包含小写的电影标题,并删除前面要忽略的所有单词(例如“the”、“a”)。不要尝试动态地执行此操作。
当字段更新时,您还必须更新 sort_name 列。您可以随时重建它,当然您必须对其建立索引。然后只需执行以下操作:
Basically how you do this is you have an extra column for sorting. If you have a
movie
table with aname
column, add another column calledsort_name
. That should contain the movie title in lowercase with any words that you want to ignore from the front removed (eg "the", "a").Don't try and do this dynamically.
When the field is updated you'll also have to update the sort_name column. You can rebuild it at any time and you will of course have to index it. Then just do: