使用 eXist-db 进行不区分大小写的搜索

发布于 2024-10-09 16:16:47 字数 1294 浏览 0 评论 0原文

我正在经历客户发布的最终改进,这需要我执行不区分大小写的查询。我将基本上介绍这个简单程序的工作原理。

首先,在我的 Java 类中,我做了一个相当简单的网页解析:

title=(String)results.get("title");
doc = docBuilder.parse("http://" + server + ":" + port + "/exist/rest/db/wb/xql/media_lookup.xql?" + "&title="  + title);

这个 Java 语句引用了存储在 localhost 上的 XQuery 文件“media_lookup.xql”,我们传递的唯一参数是字符串“title”。

其次,让我们看一下该 XQuery 文件:

$title := request:get-parameter('title',""),

$mediaNodes := doc('/db/wb/portfolio/media_data.xml'),
$query := $mediaNodes//media[contains(title,$title)],

然后它将评估该查询。该 XQuery 将获取从 Java 类传递的“title”参数,并查询存储在数据库中的“media_data”xml 文件,该文件包含一堆带有“title”元素节点的媒体节点。正如您所期望的,这个简单的查询将只匹配那些“title”元素包含字符串“title”值的子字符串的媒体节点。因此,如果我们的“标题”是“Chi”,它将返回标题可能是“Chicago”或“Chicken”的媒体节点。

客户提出的细化要求是不应该区分大小写。非常直观的方法是通过使用其中的小写函数来修改 XQuery 语句,例如:

$query := $mediaNodes//media[contains(lower-case(title/text(),lower-case($title))],

但是,问题来了:这个修改后的查询将使我的机器内存溢出。由于我的“media_data.xml”非常巨大并且包含数以千计的媒体节点, 我假设 lower-case() 函数将在每个条目上运行,从而导致机器崩溃。

我和一些经验丰富的 XQuery 程序员交谈过,他们认为我应该使用索引来解决这个问题,我一定会对此进行研究。但在此之前,我只是将这个问题发布在这里以获得其他想法或任何建议,您认为其他方法可能有帮助吗?例如,我可以调整 Java 解析语句以实现不区分大小写吗?因为我想我看到有些人通过使用“包含”进行了一些字符串连接。在将其传递到服务器之前使用 Java 编写。

欢迎任何想法或帮助。

I am going through a final refinement posted by the client, which needs me to do a case-insensitive query. I will basically walk through how this simple program works.

First of all, in my Java class, I did a fairly simple webpage parsing:

title=(String)results.get("title");
doc = docBuilder.parse("http://" + server + ":" + port + "/exist/rest/db/wb/xql/media_lookup.xql?" + "&title="  + title);

This Java statement references an XQuery file "media_lookup.xql" which is stored on localhost, and the only parameter we are passing is the string "title".

Secondly, let's take at look at that XQuery file:

$title := request:get-parameter('title',""),

$mediaNodes := doc('/db/wb/portfolio/media_data.xml'),
$query := $mediaNodes//media[contains(title,$title)],

Then it will evaluate that query. This XQuery will get the "title" parameter that are passes from our Java class, and query the "media_data" xml file stored in the database, which contains a bunch of media nodes with a 'title' element node. As you may expect, this simple query will just match those media nodes whose 'title' element contains a substring of what the value of string 'title' is. So if our 'title' is "Chi", it will return media nodes whose title may be "Chicago" or "Chicken".

The refinement request posted by the client is that there should be NO case-sensitivity. The very intuitive way is to modify the XQuery statement by using a lower-case function in it, like:

$query := $mediaNodes//media[contains(lower-case(title/text(),lower-case($title))],

However, the question comes: this modified query will run my machine into memory overflow. Since my "media_data.xml" is quite huge and contains thouands of millions of media nodes,
I assume the lower-case() function will run on each of the entries, thus causing the machine to crash.

I've talked with some experienced XQuery programmer, and they think I should use an index to solve this problem, and I will definitely research into that. But before that, I am just posting this problem here to get other ideas or any suggestions, do you think any other way may help? for example, could I tweak the Java parse statement to realize the case-insensitivity? Since I think I saw some people did some string concatenation by using "contains." in Java before passing it to the server.

Any idea or help is welcomed.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

冷情妓 2024-10-16 16:16:47

提交的优化请求
客户认为不应该有
区分大小写。非常直观的
方法是修改XQuery语句
通过在其中使用小写函数,
像:

$query := $mediaNodes//media
            [contains(lower-case(title/text(),lower-case($title))], 

但是,问题来了:这个
修改后的查询将运行我的机器
导致内存溢出。自从我的
“media_data.xml”相当大并且
包含数以千万计的媒体
节点,我假设小写()
函数将在每个上运行
条目,从而导致机器
崩溃。

这种担心是没有道理的。

任何正常的 XPath 实现都使用自动记忆来实现其功能。这意味着评估特定谓词所需的内存(包括 lower-case() 的结果)将在评估后立即被释放(在没有垃圾收集的语言中)或不再被引用并准备好进行垃圾收集谓词。

The refinement request posted by the
client is that there should be NO
case-sensitivity. The very intuitive
way is to modify the XQuery statement
by using a lower-case function in it,
like:

$query := $mediaNodes//media
            [contains(lower-case(title/text(),lower-case($title))], 

However, the question comes: this
modified query will run my machine
into memory overflow. Since my
"media_data.xml" is quite huge and
contains thousands of millions of media
nodes, I assume the lower-case()
function will run on each of the
entries, thus causing the machine to
crash.

Such fears are not justified.

Any sane implementation of XPath uses automatic memory for its functions. This means that the memory required for evaluating a particular predicate, including the result of lower-case() becomes freed (in languages with no garbage collection) or unreferenced and ready for garbage collection immediately after the evaluation of the predicate.

殤城〤 2024-10-16 16:16:47

表索引可能不是解决方案,因为索引的绝对化会减慢速度,但不会触发内存溢出。

我认为最好的选择是复制数据库中的标题,将其复制为全小写(或大写,以更清楚地表明它已转换),并在显示正常标题时查询备用标题。

要节省一些处理,您可以在查询之前对 $product 进行大小写转换。

您可以在 URL 中删除 & 符号,我不确定所有网络服务器是否都能正确解析 ?&

A table index probably is not the solution as absebse of an index will slow things down, but not trigger a memory overflow.

I think your best bet is to duplicate the title in your database copying it into an all-lowercase (or uppercase with makes more clear that it was converted) and query the alternate title while presenting the normal title.

To save some processing to you can do the case coversion of $product before the query.

You can drop the ampersand in your URL, I'm not sure all webservers parse the ?& correctly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文