将字符串拆分为有意义的单词

发布于 2024-11-28 02:31:35 字数 241 浏览 0 评论 0原文

我正在用 Java 开发一个应用程序,它将解析 XML 文件并从中检索关键字并将其存储在我的数据库中。用户可以搜索这些关键字并检索相关数据。

现在的问题是 XML 文件包含“literacy_male”、“infantmortalityrate_female”等单词。对于第一个文件,我可以在存储之前在“_”处拆分单词,但对于第二个文件,我不确定如何拆分单词单词变成有意义的单词。

我正在使用 Apache Lucene 进行全文搜索。

I am developing an application in Java which will parse a XML file and retrieve keywords from it and store it in my database. These keywords can then be searched by users and they can retrieve the related data.

Now the problem is that the XML file has words like "literacy_male","infantmortalityrate_female" etc. For the first one I can split the words at "_" before storing, but for the second one I am not sure how i can split the word into meaningful words.

I am using Apache Lucene to do the full text search.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

锦上情书 2024-12-05 02:31:35

一种可能性是通过添加完全相同字符串的所有子字符串来增加索引大小。因此,对于“abc”,您将存储:“a”,“b”,“c”,“ab”,“bc”,“abc”(它是 O(n^2) 字符串)。

另一种可能性是使用通配符。索引你拥有的一切,并搜索:

*,a**,...,z**。这将花费更多时间,但不会增加索引大小。

注意:有必要搜索如此多的术语,因为您不能使用通配符作为术语的第一个字母。

a** 表示搜索所有以 a 开头的术语,然后没有或任何字符,然后 然后没有或任何字符再次字符。

有关 lucene 中的术语和通配符的更多信息: http://lucene.apache.org/java /2_0_0/queryparsersyntax.html

编辑:

这些的组合将提供(在我看来)最好的解决方案:

索引字符串的所有后缀,然后针对每个term(而不是查询!) - 而不是搜索 搜索 < ;术语>*。如果该术语作为子字符串存在,它也至少开始一个前缀,并且它会找到它。

例如:如果您有"lifeexpectancy",您将索引:
"lifeexpectancy","ifeexpectancy","feexpectancy","eexpectancy",....,"y ”

对于同一示例,当您要搜索lifeexpectancy时,您将搜索life*expectancy*

one possibility is increasing the index size by adding all substrings of the exact same string. so for "abc" you will store: "a","b","c","ab","bc","abc" (it's O(n^2) strings).

one more possibility is using wildcards. index whatever you have, and search for:

<term>*,a*<term>*,...,z*<term>* instead of for <term>. it will take a LOT more time, but it will not increase the index size.

note: it is necessary to search for so many terms because you CANNOT use wildcard as first letter of a term.

a*<term>* means search for all terms start with a, then have none or any chars, then <term> and then none or any chars again.

more info about terms and wild cards in lucene: http://lucene.apache.org/java/2_0_0/queryparsersyntax.html

EDIT:

a combination of those will provide (in my opinion) the best solution:

index all suffixes of the string, and then for each term (and not query!) - instead of searching for <term> search for <term>*. if the term exist as a substring, it also starts at least one prefix, and it will find it.

for example: if you have "lifeexpectancy", you will index:
"lifeexpectancy","ifeexpectancy","feexpectancy","eexpectancy",....,"y"

for the same example, when you want to search life expectancy, you will search life* expectancy*

半仙 2024-12-05 02:31:35

没有纯粹的算法方法可以实现您的目标,也没有一种方法可以高可靠性地实现这一目标。您基本上需要有一个“有意义”单词的字典来搜索,并在字典中搜索作为组合前缀的最长单词后,“剥离”长组合中的每个单词。但是,如果您有“workmanhours”,并且您将其解析为“workman”“hours”,而实际上它可能应该是“work”“man”“hours”,那么您可能会发疯。

您可以通过索引选定的字符序列而不是单词来完善您的搜索方案。例如,建立以前导元音开头的所有序列的索引,然后类似地将搜索项剥离为前导元音。

There's no purely algorithmic way to accomplish your goal, nor is there a way to do it with high reliability. You'd basically need to have a dictionary of "meaningful" words to search, and "peel" off each word in a long combo after searching the dictionary for the longest word that was a prefix of your combo. But you can run amok if, eg, you have "workmanhours" and you parse it into "workman" "hours" when it maybe should be "work" "man" "hours".

You could possibly finesse your search scheme by indexing selected character sequences rather than words. Eg, build an index of all sequences that start with a leading vowel and then similarly strip your search terms down to a leading vowel.

讽刺将军 2024-12-05 02:31:35

您需要设置一些有关如何格式化 XML 文件的规则才能使其正常工作。

我想您无法操作 XML 文件(或者它已经创建并填充)?

如果可以(或者它是由您的代码生成的),您需要设置一些规则,例如

  • 分隔的关键字 a,
  • 关键字没有空格,但使用 _ 代替

有了这个规则,您将能够编写一个可以理解您的关键字字符串的解析器。

如果您做不到这一点,则需要解析关键字并尝试不同的解析(例如“split by _”),然后查看哪一个可以产生最佳输出。但这将具有挑战性并需要时间。

另请在您的原始问题中添加 XML 文件示例。

You'll need to set some rules about how the XML-File must be formated in order to get this working.

I guess you can't manipulate the XML-File (or it is already created and populated)?

If you can (or it's being generated by your code), you'll need to set some rules like

  • Keywords a separated by an ,
  • Keywords have no spaces but use _ instead

With this rules, you'll be able to write a parser which can make sense of your keyword-strings.

If you can't do that, you'll need to parse a keyword and try the different parsings (like "split by _") and see which one makes the best output. But this will be challenging and causes time.

Please also add a sample of your XML-file to your original question.

止于盛夏 2024-12-05 02:31:35

计算机并不聪明,它们理解你告诉他们的内容。所以,如果你在生成 XML 文件时保持一些标准,那就更容易了。否则我认为没有任何方法可以将“婴儿死亡率”转换为“婴儿+死亡率+比率” ”

computer are not intelligent,they understand what you tell 'em.So, it would be easier if you maintain some standard while generating your XML file.otherwise i dont think there is any way to convert "infantmortalityrat" into "infant+mortality+rate"

近箐 2024-12-05 02:31:35

如果您有可以包含在该字符串中的字符串数据库,您可以执行以下操作:

按您可以识别的分隔符拆分字符串(例如 _-,...) 以及之后,每个部分都可以分解为您可以通过数据库中最短字符串的总和识别的尽可能多的部分,

就像您有 10 个字符的字符串,而数据库中最短的字符串是 4 个字符,你可以获得这些组合:

4,6
5,5
6,4
10

没有 4,4,2 或这样的东西

,之后您可以在数据库中查找每个部分,如果每个部分都存在,您可以说它被分为“有意义的单词”,

但没有该数据库,或者使用太常见的字典,你可能会陷入困境,否则几乎不可能

If you'd have database of strings that can be contained in that string you could do this:

Split the string by separators you can identify (like _,,,-,...) and after, each part could be broken to as many parts as you can identify by sum of shortest strings in DB

like it you have string in 10 chars and shortest string in DB is 4 chars, you can get these combos:

4,6
5,5
6,4
10

no 4,4,2 or sth like this

and after that you can look up each part in DB and if every part is present you can say it is divided into "meaninfull words"

but without that database, or with too common dictionary, you can stuck on this or it could be almost impossible

﹂绝世的画 2024-12-05 02:31:35

是的,即使没有分割字符,也可以将字符串分割成单词。这可以在 O(n) 附近非常有效地解决。考虑使用前缀字符串正则表达式并从字符串中逐字提取。您也可以检查此工具 http://code.google.com/p/图表达式/wiki/RegexpOptimization

有更强大(更有效,因为它使用全局优化而不是像以前那样局部优化)的方法,使用拼写检查自动机来搜索最可能的字符串分割。查看本教程,了解如何处理中文字符串 http:// alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html

yes it is possible to split string into words even if there are no split characters. This can be solved pretty efficient near O(n). Consider using prefix string regular expression and extract word by word from you string. You can check this tool as well http://code.google.com/p/graph-expression/wiki/RegexpOptimization.

There are more robust(more effective couse it use global optimisation not local as previos) approach using spell checking automaton which is searching for most propable split of string. Check this tutorial on how its done on Chinese word strings http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文