如何将字符串拆分为单词。例如:“stringintowords” -> 《串成文字》?

发布于 2024-09-13 20:54:54 字数 239 浏览 14 评论 0原文

将字符串拆分为单词的正确方法是什么? (字符串不包含任何空格或标点符号)

例如:“stringintowords”-> “串成单词”

您能建议这里应该使用什么算法吗?

!更新:对于那些认为这个问题只是出于好奇的人。该算法可用于驼峰域名(“sportandfishing .com”->“SportAndFishing .com”),并且 aboutus dot org 目前使用该算法来动态执行此转换。

What is the right way to split a string into words ?
(string doesn't contain any spaces or punctuation marks)

For example: "stringintowords" -> "String Into Words"

Could you please advise what algorithm should be used here ?

! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(10

得不到的就毁灭 2024-09-20 20:54:54

假设您有一个函数 isWord(w),它使用字典检查 w 是否是一个单词。为了简单起见,我们现在还假设您只想知道对于某些单词 w 是否可以进行这样的拆分。这可以通过动态规划轻松完成。

S[1..length(w)] 为包含布尔条目的表。如果单词 w[1..i] 可以拆分,则 S[i] 为 true。然后设置S[1] = isWord(w[1])for i=2length(w)计算

S[i] =(isWord[w[1..i] 或对于 {2..i} 中的任何 j:S[j-1] 和 isWord[j..i])。

如果字典查询是常数时间,则这需要 O(length(w)^2) 时间。要真正找到分裂,只需将获胜的分裂存储在每个设置为 true 的 S[i] 中。这也可以适用于通过存储所有此类拆分来枚举所有解决方案。

Let's assume that you have a function isWord(w), which checks if w is a word using a dictionary. Let's for simplicity also assume for now that you only want to know whether for some word w such a splitting is possible. This can be easily done with dynamic programming.

Let S[1..length(w)] be a table with Boolean entries. S[i] is true if the word w[1..i] can be split. Then set S[1] = isWord(w[1]) and for i=2 to length(w) calculate

S[i] = (isWord[w[1..i] or for any j in {2..i}: S[j-1] and isWord[j..i]).

This takes O(length(w)^2) time, if dictionary queries are constant time. To actually find the splitting, just store the winning split in each S[i] that is set to true. This can also be adapted to enumerate all solution by storing all such splits.

离笑几人歌 2024-09-20 20:54:54

正如这里许多人提到的,这是一个标准的、简单的动态规划问题:Falk Hüffner 给出了最佳解决方案。不过,附加信息:

(a) 您应该考虑使用 trie 实现 isWord,如果使用得当(即通过逐步测试单词),这将为您节省大量时间。

(b) 输入“分段动态规划”会产生大量更详细的答案,这些答案来自带有伪代码算法的大学级别讲座,例如 杜克大学的这个讲座(甚至提供了一种简单的概率方法来处理当您遇到无法理解的单词时该怎么办)包含在任何字典中)。

As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given by Falk Hüffner. Additional info though:

(a) you should consider implementing isWord with a trie, which will save you a lot of time if you use properly (that is by incrementally testing for words).

(b) typing "segmentation dynamic programming" yields a score of more detail answers, from university level lectures with pseudo-code algorithm, such as this lecture at Duke's (which even goes so far as to provide a simple probabilistic approach to deal with what to do when you have words that won't be contained in any dictionary).

凉城已无爱 2024-09-20 20:54:54

学术文献中应该有不少这方面的内容。您要搜索的关键词是 分词.例如,这篇论文看起来很有前景。

一般来说,您可能想了解马尔可夫模型维特比算法。后者是一种动态编程算法,可以让您找到字符串的合理分段,而无需详尽地测试每个可能的分段。这里的基本见解是,如果前 m 个字符有 n 个可能的分割,并且您只想找到最可能的分割,则不需要针对后续字符评估其中的每一个 - 您只需要继续评估最有可能的一个。

There should be a fair bit in the academic literature on this. The key words you want to search for are word segmentation. This paper looks promising, for example.

In general, you'll probably want to learn about markov models and the viterbi algorithm. The latter is a dynamic programming algorithm that may allow you to find plausible segmentations for a string without exhaustively testing every possible segmentation. The essential insight here is that if you have n possible segmentations for the first m characters, and you only want to find the most likely segmentation, you don't need to evaluate every one of these against subsequent characters - you only need to continue evaluating the most likely one.

十六岁半 2024-09-20 20:54:54

如果您想确保正确执行此操作,则必须使用基于字典的方法,但效率会非常低。您还必须期望从算法中收到多个结果。

例如:windowsteambloghttp://windowsteamblog.com/ 名气)

  • windows 团队 博客
  • 窗口 steam 博客

If you want to ensure that you get this right, you'll have to use a dictionary based approach and it'll be horrendously inefficient. You'll also have to expect to receive multiple results from your algorithm.

For example: windowsteamblog (of http://windowsteamblog.com/ fame)

  • windows team blog
  • window steam blog
╰つ倒转 2024-09-20 20:54:54

考虑给定字符串可能分裂的绝对数量。如果字符串中有 n 个字符,则有 n-1 个可能的位置可以拆分。例如,对于字符串 cat,您可以在 a 之前拆分,也可以在 t 之前拆分。这会导致 4 种可能的分裂。

您可以将此问题视为选择需要拆分字符串的位置。您还需要选择有多少个拆分。因此存在 Sum(i = 0 to n - 1, n - 1 select i) 可能的分裂。根据二项式系数定理,x和y均为1,这等于pow( 2、n-1)。

当然,很多计算都依赖于常见的子问题,因此动态编程可能会加快你的算法速度。我突然想到,计算一个布尔矩阵 M,这样 M[i,j] 才为真,当且仅当给定字符串从 i 到 j 的子串是一个单词时会帮助解决很多问题。少量。您仍然有指数数量的可能分段,但如果早期拆分没有形成单词,您很快就能消除分段。解决方案将是一个整数序列 (i0, j0, i1, j1, ...),条件是 j sub k = i sub (k + 1)

如果您的目标是正确的驼峰式 URL,我会回避这个问题并采取更直接的方法:获取 URL 的主页,从源 HTML 中删除所有空格和大写字母,然后搜索您的字符串。如果存在匹配,则在原始 HTML 中找到该部分并返回。您需要一个 NumSpaces 数组来声明原始字符串中出现了多少空白,如下所示:

Needle:       isashort    
Haystack:     This is a short phrase    
Preprocessed: thisisashortphrase   
NumSpaces   : 000011233333444444 

您的答案将来自:

location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)

当然,如果 madduckets.com 主页上的某处没有“Mad Duckets”,这就会中断。唉,这就是避免指数问题所付出的代价。

Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.

You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).

Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).

If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:

Needle:       isashort    
Haystack:     This is a short phrase    
Preprocessed: thisisashortphrase   
NumSpaces   : 000011233333444444 

And your answer would come from:

location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)

Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.

安静 2024-09-20 20:54:54

这实际上可以在没有字典的情况下(在某种程度上)完成。本质上,这是一个无监督分词问题。您需要收集大量域名,应用无监督分段学习算法(例如 Morfessor)并将学习到的模型应用于新域名。不过,我不确定它的效果如何(但这会很有趣)。

This can be actually done (to a certain degree) without dictionary. Essentially, this is an unsupervised word segmentation problem. You need to collect a large list of domain names, apply an unsupervised segmentation learning algorithm (e.g. Morfessor) and apply the learned model for new domain names. I'm not sure how well it would work, though (but it would be interesting).

一直在等你来 2024-09-20 20:54:54

这基本上是背包问题的变体,所以你需要的是一个全面的单词列表以及 Wiki 中涵盖的任何解决方案。

对于相当大小的字典,这将是极其耗费资源和冗长的操作,您甚至无法确定这个问题是否会得到解决。

This is basically a variation of a knapsack problem, so what you need is a comprehensive list of words and any of the solutions covered in Wiki.

With fairly-sized dictionary this is going to be insanely resource-intensive and lengthy operation, and you cannot even be sure that this problem will be solved.

给不了的爱 2024-09-20 20:54:54

创建可能的单词列表,将其从长单词到短单词排序。

检查列表中的每个条目是否与字符串的第一部分相对应。如果相等,请将其删除并在句子中附加一个空格。重复此操作。

Create a list of possible words, sort it from long words to short words.

Check if each entry in the list against the first part of the string. If it equals, remove this and append it at your sentence with a space. Repeat this.

楠木可依 2024-09-20 20:54:54

一个简单的 Java 解决方案,运行时间为 O(n^2)。

public class Solution {
    // should contain the list of all words, or you can use any other data structure (e.g. a Trie)
    private HashSet<String> dictionary;

    public String parse(String s) {
        return parse(s, new HashMap<String, String>());
    }

    public String parse(String s, HashMap<String, String> map) {
        if (map.containsKey(s)) {
            return map.get(s);
        }
        if (dictionary.contains(s)) {
            return s;
        }
        for (int left = 1; left < s.length(); left++) {
            String leftSub = s.substring(0, left);
            if (!dictionary.contains(leftSub)) {
                continue;
            }
            String rightSub = s.substring(left);
            String rightParsed = parse(rightSub, map);
            if (rightParsed != null) {
                String parsed = leftSub + " " + rightParsed;
                map.put(s, parsed);
                return parsed;
            }
        }
        map.put(s, null);
        return null;
    }
}

A simple Java solution which has O(n^2) running time.

public class Solution {
    // should contain the list of all words, or you can use any other data structure (e.g. a Trie)
    private HashSet<String> dictionary;

    public String parse(String s) {
        return parse(s, new HashMap<String, String>());
    }

    public String parse(String s, HashMap<String, String> map) {
        if (map.containsKey(s)) {
            return map.get(s);
        }
        if (dictionary.contains(s)) {
            return s;
        }
        for (int left = 1; left < s.length(); left++) {
            String leftSub = s.substring(0, left);
            if (!dictionary.contains(leftSub)) {
                continue;
            }
            String rightSub = s.substring(left);
            String rightParsed = parse(rightSub, map);
            if (rightParsed != null) {
                String parsed = leftSub + " " + rightParsed;
                map.put(s, parsed);
                return parsed;
            }
        }
        map.put(s, null);
        return null;
    }
}
岁吢 2024-09-20 20:54:54

我正在研究这个问题,想也许我可以分享我是如何做到的。
用语言解释我的算法有点困难,所以也许我可以用伪代码分享我的优化解决方案:

string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);

/** this way, one does not check the dictionary to check for word validity 
 *  on every substring; It would only be queried once and for all, 
 *  eliminating multiple travels to the data storage
 */
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();

validwords = validwords.sort(length, desc);

array segments = [];
while(mainword != ""){
    for(x = 0; x < validwords.length; x++){
        if(mainword.startswith(validwords[x])) {
            segments.push(validwords[x]);
            mainword = mainword.remove(v);
            x = 0;
        }
    }

    /**
     * remove the first character if any of valid words do not match, then start again
     * you may need to add the first character to the result if you want to
     */
    mainword = mainword.substring(1);
}

string result = segments.join(" ");

I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:

string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);

/** this way, one does not check the dictionary to check for word validity 
 *  on every substring; It would only be queried once and for all, 
 *  eliminating multiple travels to the data storage
 */
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();

validwords = validwords.sort(length, desc);

array segments = [];
while(mainword != ""){
    for(x = 0; x < validwords.length; x++){
        if(mainword.startswith(validwords[x])) {
            segments.push(validwords[x]);
            mainword = mainword.remove(v);
            x = 0;
        }
    }

    /**
     * remove the first character if any of valid words do not match, then start again
     * you may need to add the first character to the result if you want to
     */
    mainword = mainword.substring(1);
}

string result = segments.join(" ");
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文