当前位置：文江博客话题详情

如何将字符串拆分为单词。例如：“stringintowords” -> 《串成文字》？

发布于 2024-09-13 20:54:54 字数 239 浏览 19 评论 0原文

将字符串拆分为单词的正确方法是什么？（字符串不包含任何空格或标点符号）

例如：“stringintowords”-> “串成单词”

您能建议这里应该使用什么算法吗？

！更新：对于那些认为这个问题只是出于好奇的人。该算法可用于驼峰域名（“sportandfishing .com”->“SportAndFishing .com”），并且 aboutus dot org 目前使用该算法来动态执行此转换。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

得不到的就毁灭 2024-09-20 20:54:54

假设您有一个函数 isWord(w)，它使用字典检查 w 是否是一个单词。为了简单起见，我们现在还假设您只想知道对于某些单词 w 是否可以进行这样的拆分。这可以通过动态规划轻松完成。

令 S[1..length(w)] 为包含布尔条目的表。如果单词 w[1..i] 可以拆分，则 S[i] 为 true。然后设置S[1] = isWord(w[1])和for i=2为length(w)计算

S[i] =（isWord[w[1..i] 或对于 {2..i} 中的任何 j：S[j-1] 和 isWord[j..i]）。

如果字典查询是常数时间，则这需要 O(length(w)^2) 时间。要真正找到分裂，只需将获胜的分裂存储在每个设置为 true 的 S[i] 中。这也可以适用于通过存储所有此类拆分来枚举所有解决方案。

回复收藏 0 原文

离笑几人歌 2024-09-20 20:54:54

正如这里许多人提到的，这是一个标准的、简单的动态规划问题：Falk Hüffner 给出了最佳解决方案。不过，附加信息：

(a) 您应该考虑使用 trie 实现 isWord，如果使用得当（即通过逐步测试单词），这将为您节省大量时间。

(b) 输入“分段动态规划”会产生大量更详细的答案，这些答案来自带有伪代码算法的大学级别讲座，例如杜克大学的这个讲座（甚至提供了一种简单的概率方法来处理当您遇到无法理解的单词时该怎么办）包含在任何字典中）。

回复收藏 0 原文

凉城已无爱 2024-09-20 20:54:54

学术文献中应该有不少这方面的内容。您要搜索的关键词是分词.例如，这篇论文看起来很有前景。

一般来说，您可能想了解马尔可夫模型和维特比算法。后者是一种动态编程算法，可以让您找到字符串的合理分段，而无需详尽地测试每个可能的分段。这里的基本见解是，如果前 m 个字符有 n 个可能的分割，并且您只想找到最可能的分割，则不需要针对后续字符评估其中的每一个 - 您只需要继续评估最有可能的一个。

回复收藏 0 原文

十六岁半 2024-09-20 20:54:54

如果您想确保正确执行此操作，则必须使用基于字典的方法，但效率会非常低。您还必须期望从算法中收到多个结果。

例如：windowsteamblog（http://windowsteamblog.com/ 名气）

windows 团队 博客
窗口 steam 博客

回复收藏 0 原文

╰つ倒转 2024-09-20 20:54:54

考虑给定字符串可能分裂的绝对数量。如果字符串中有 n 个字符，则有 n-1 个可能的位置可以拆分。例如，对于字符串 cat，您可以在 a 之前拆分，也可以在 t 之前拆分。这会导致 4 种可能的分裂。

您可以将此问题视为选择需要拆分字符串的位置。您还需要选择有多少个拆分。因此存在 Sum(i = 0 to n - 1, n - 1 select i) 可能的分裂。根据二项式系数定理，x和y均为1，这等于pow( 2、n-1)。

当然，很多计算都依赖于常见的子问题，因此动态编程可能会加快你的算法速度。我突然想到，计算一个布尔矩阵 M，这样 M[i,j] 才为真，当且仅当给定字符串从 i 到 j 的子串是一个单词时会帮助解决很多问题。少量。您仍然有指数数量的可能分段，但如果早期拆分没有形成单词，您很快就能消除分段。解决方案将是一个整数序列 (i0, j0, i1, j1, ...)，条件是 j sub k = i sub (k + 1)。

如果您的目标是正确的驼峰式 URL，我会回避这个问题并采取更直接的方法：获取 URL 的主页，从源 HTML 中删除所有空格和大写字母，然后搜索您的字符串。如果存在匹配，则在原始 HTML 中找到该部分并返回。您需要一个 NumSpaces 数组来声明原始字符串中出现了多少空白，如下所示：

Needle:       isashort    
Haystack:     This is a short phrase    
Preprocessed: thisisashortphrase   
NumSpaces   : 000011233333444444

您的答案将来自：

location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)

当然，如果 madduckets.com 主页上的某处没有“Mad Duckets”，这就会中断。唉，这就是避免指数问题所付出的代价。

Consider the sheer number of possible splittings for a given string. If you have n characters in the string, there are n-1 possible places to split. For example, for the string cat, you can split before the a and you can split before the t. This results in 4 possible splittings.

You could look at this problem as choosing where you need to split the string. You also need to choose how many splits there will be. So there are Sum(i = 0 to n - 1, n - 1 choose i) possible splittings. By the Binomial Coefficient Theorem, with x and y both being 1, this is equal to pow(2, n-1).

Granted, a lot of this computation rests on common subproblems, so Dynamic Programming might speed up your algorithm. Off the top of my head, computing a boolean matrix M such M[i,j] is true if and only if the substring of your given string from i to j is a word would help out quite a bit. You still have an exponential number of possible segmentations, but you would quickly be able to eliminate a segmentation if an early split did not form a word. A solution would then be a sequence of integers (i0, j0, i1, j1, ...) with the condition that j sub k = i sub (k + 1).

If your goal is correctly camel case URL's, I would sidestep the problem and go for something a little more direct: Get the homepage for the URL, remove any spaces and capitalization from the source HTML, and search for your string. If there is a match, find that section in the original HTML and return it. You'd need an array of NumSpaces that declares how much whitespace occurs in the original string like so:

Needle:       isashort    
Haystack:     This is a short phrase    
Preprocessed: thisisashortphrase   
NumSpaces   : 000011233333444444

And your answer would come from:

location = prepocessed.Search(Needle)
locationInOriginal = location + NumSpaces[location]
originalLength = Needle.length() + NumSpaces[location + needle.length()] - NumSpaces[location]
Haystack.substring(locationInOriginal, originalLength)

Of course, this would break if madduckets.com did not have "Mad Duckets" somewhere on the home page. Alas, that is the price you pay for avoiding an exponential problem.

回复收藏 0 原文

安静 2024-09-20 20:54:54

这实际上可以在没有字典的情况下（在某种程度上）完成。本质上，这是一个无监督分词问题。您需要收集大量域名，应用无监督分段学习算法（例如 Morfessor）并将学习到的模型应用于新域名。不过，我不确定它的效果如何（但这会很有趣）。

回复收藏 0 原文

一直在等你来 2024-09-20 20:54:54

这基本上是背包问题的变体，所以你需要的是一个全面的单词列表以及 Wiki 中涵盖的任何解决方案。

对于相当大小的字典，这将是极其耗费资源和冗长的操作，您甚至无法确定这个问题是否会得到解决。

回复收藏 0 原文

给不了的爱 2024-09-20 20:54:54

创建可能的单词列表，将其从长单词到短单词排序。

检查列表中的每个条目是否与字符串的第一部分相对应。如果相等，请将其删除并在句子中附加一个空格。重复此操作。

回复收藏 0 原文

楠木可依 2024-09-20 20:54:54

一个简单的 Java 解决方案，运行时间为 O(n^2)。

public class Solution {
    // should contain the list of all words, or you can use any other data structure (e.g. a Trie)
    private HashSet<String> dictionary;

    public String parse(String s) {
        return parse(s, new HashMap<String, String>());
    }

    public String parse(String s, HashMap<String, String> map) {
        if (map.containsKey(s)) {
            return map.get(s);
        }
        if (dictionary.contains(s)) {
            return s;
        }
        for (int left = 1; left < s.length(); left++) {
            String leftSub = s.substring(0, left);
            if (!dictionary.contains(leftSub)) {
                continue;
            }
            String rightSub = s.substring(left);
            String rightParsed = parse(rightSub, map);
            if (rightParsed != null) {
                String parsed = leftSub + " " + rightParsed;
                map.put(s, parsed);
                return parsed;
            }
        }
        map.put(s, null);
        return null;
    }
}

A simple Java solution which has O(n^2) running time.

public class Solution {
    // should contain the list of all words, or you can use any other data structure (e.g. a Trie)
    private HashSet<String> dictionary;

    public String parse(String s) {
        return parse(s, new HashMap<String, String>());
    }

    public String parse(String s, HashMap<String, String> map) {
        if (map.containsKey(s)) {
            return map.get(s);
        }
        if (dictionary.contains(s)) {
            return s;
        }
        for (int left = 1; left < s.length(); left++) {
            String leftSub = s.substring(0, left);
            if (!dictionary.contains(leftSub)) {
                continue;
            }
            String rightSub = s.substring(left);
            String rightParsed = parse(rightSub, map);
            if (rightParsed != null) {
                String parsed = leftSub + " " + rightParsed;
                map.put(s, parsed);
                return parsed;
            }
        }
        map.put(s, null);
        return null;
    }
}

回复收藏 0 原文

岁吢 2024-09-20 20:54:54

我正在研究这个问题，想也许我可以分享我是如何做到的。
用语言解释我的算法有点困难，所以也许我可以用伪代码分享我的优化解决方案：

string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);

/** this way, one does not check the dictionary to check for word validity 
 *  on every substring; It would only be queried once and for all, 
 *  eliminating multiple travels to the data storage
 */
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();

validwords = validwords.sort(length, desc);

array segments = [];
while(mainword != ""){
    for(x = 0; x < validwords.length; x++){
        if(mainword.startswith(validwords[x])) {
            segments.push(validwords[x]);
            mainword = mainword.remove(v);
            x = 0;
        }
    }

    /**
     * remove the first character if any of valid words do not match, then start again
     * you may need to add the first character to the result if you want to
     */
    mainword = mainword.substring(1);
}

string result = segments.join(" ");

I was looking at the problem and thought maybe I could share how I did it.
It's a little too hard to explain my algorithm in words so maybe I could share my optimized solution in pseudocode:

string mainword = "stringintowords";
array substrings = get_all_substrings(mainword);

/** this way, one does not check the dictionary to check for word validity 
 *  on every substring; It would only be queried once and for all, 
 *  eliminating multiple travels to the data storage
 */
string query = "select word from dictionary where word in " + substrings;
array validwords = execute(query).getArray();

validwords = validwords.sort(length, desc);

array segments = [];
while(mainword != ""){
    for(x = 0; x < validwords.length; x++){
        if(mainword.startswith(validwords[x])) {
            segments.push(validwords[x]);
            mainword = mainword.remove(v);
            x = 0;
        }
    }

    /**
     * remove the first character if any of valid words do not match, then start again
     * you may need to add the first character to the result if you want to
     */
    mainword = mainword.substring(1);
}

string result = segments.join(" ");

回复收藏 0 原文

~没有更多了~