解决文本两字符列乱序的方法

发布于 2025-01-05 20:25:17 字数 1615 浏览 1 评论 0原文

我有一段由两个字符的列打乱的文本。我的作业的目的是解读它：

|de|  | f|Cl|nf|ed|au| i|ti|  |ma|ha|or|nn|ou| S|on|nd|on|
|ry|  |is|th|is| b|eo|as|  |  |f |wh| o|ic| t|, |  |he|h |
|ab|  |la|pr|od|ge|ob| m|an|  |s |is|el|ti|ng|il|d |ua|c |
|he|  |ea|of|ho| m| t|et|ha|  | t|od|ds|e |ki| c|t |ng|br|
|wo|m,|to|yo|hi|ve|u | t|ob|  |pr|d |s |us| s|ul|le|ol|e |
| t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er|
|ry|d |un|Th|" |io|eo|n,|is|  |bl|f |pu|Co|ic| o|he|at|mm|
|hi|  |  |in|  |  | t|  |  |  |  |ye|  |ar|  |s |  |  |. |

我当前找到正确列顺序的方法是尝试根据单词出现计数标准递归地找到每列的最佳位置。

我想到的算法核心的伪代码是：

function unscramble(scrambledMatrix,indexOfColumnIveJustMoved)
    for each column on scrambledMatrix as currentIndex=>currentColumn
       if (currentIndex!=indexOfColumnIveJustMoved)
           maxRepeatedWords=0;maxIndex=0;
           for (i=0;i<numberOfColumnsOfScrambledMatrix;i++)
              repWordsCount=countRepWords(moveFromToOn(currentIndex,i,scrambledMatrix))
              if (maxRepeatedWords<repWordsCount)
                  maxRepeatedWords=repWordsCount;
                  maxIndex=i;
              endif
           endfor
           if (maxIndex!=currentIndex)
               return unscramble(moveFromToOn(currentIndex,maxIndex,scrambledMatrix),maxIndex); //recursive call
           endif
       endif
    endfor
    return(scrambledMatrix); //returns the unscrambled matrix;
endfunction

当迭代每一列后没有移动任何列时，算法停止。我猜它应该适用于任何语言（尽管我只对英语的解决方案感兴趣），只要写作是基于由字母组成的单词并且样本足够大。

对其他方法或改进有什么建议吗？我想知道这个问题的最佳解决方案（可能是基于字典的字典来查找常见单词的出现？重建算法以避免递归怎么样，会更快吗？）。

原文

I have a paragraph of text scrambled by columns of two chars. The purpose of my assignment is to unscramble it:

|de|  | f|Cl|nf|ed|au| i|ti|  |ma|ha|or|nn|ou| S|on|nd|on|
|ry|  |is|th|is| b|eo|as|  |  |f |wh| o|ic| t|, |  |he|h |
|ab|  |la|pr|od|ge|ob| m|an|  |s |is|el|ti|ng|il|d |ua|c |
|he|  |ea|of|ho| m| t|et|ha|  | t|od|ds|e |ki| c|t |ng|br|
|wo|m,|to|yo|hi|ve|u | t|ob|  |pr|d |s |us| s|ul|le|ol|e |
| t|ca| t|wi| M|d |th|"A|ma|l |he| p|at|ap|it|he|ti|le|er|
|ry|d |un|Th|" |io|eo|n,|is|  |bl|f |pu|Co|ic| o|he|at|mm|
|hi|  |  |in|  |  | t|  |  |  |  |ye|  |ar|  |s |  |  |. |

My current approach to find the right order of columns is trying to recursively find each column's best position according to a word occurrence count criteria.

The pseudo-code of the algorithm's core I have in mind would be:

function unscramble(scrambledMatrix,indexOfColumnIveJustMoved)
    for each column on scrambledMatrix as currentIndex=>currentColumn
       if (currentIndex!=indexOfColumnIveJustMoved)
           maxRepeatedWords=0;maxIndex=0;
           for (i=0;i<numberOfColumnsOfScrambledMatrix;i++)
              repWordsCount=countRepWords(moveFromToOn(currentIndex,i,scrambledMatrix))
              if (maxRepeatedWords<repWordsCount)
                  maxRepeatedWords=repWordsCount;
                  maxIndex=i;
              endif
           endfor
           if (maxIndex!=currentIndex)
               return unscramble(moveFromToOn(currentIndex,maxIndex,scrambledMatrix),maxIndex); //recursive call
           endif
       endif
    endfor
    return(scrambledMatrix); //returns the unscrambled matrix;
endfunction

The algorithm stops when no column is moved after iterating on each one. I'm guessing it should work for any language (though I'm only interested on a solution for english) as long as the writing is based on words formed by letters and the sample is big enough.

Any suggestions on any other approaches or improvements? I would like to know the best solution for this problem (probably a dictionary based one looking for occurrences of common words instead? How about rebuilding the algorithm to avoid recursion, would it be much faster?).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

弃爱 2025-01-12 20:25:17

还有几个想法：

报价，对于每个开放报价，其后必须有一个结束报价。
大写字母，通常是句子或名词等的开头（任何适用的附加语法规则
使用足够小的字典来容纳所有内容，并计算特定排列中有效单词的数量。

一种方法，尽管通常这种方法是其中之一最耗时的 - 是使用遗传算法

假设当前默认的列排列是

|de|  | f|Cl|nf|ed|au| i|ti|  |ma|ha|or|nn|ou| S|on|nd|on|
[0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18] <--- define this to be a chromosome

您可以创建一个 100、1000 条染色体的群体，这些染色体数量开始随机分配（请记住“随机”分配）。不能有重复的。

然后对每个作业运行一个适应度函数，或者如果您想以这种方式分解它，则运行多个适应度函数，从一个超级适应度函数开始，该函数为每个作业分配一个适应度

值 50% 的染色体并将它们转移到下一代，您可以根据您选择的交叉函数和突变概率创建“子”染色体 - 对于此类问题，我建议使用非常轻的交叉函数（或无交叉函数）。 .）和不错的突变率。如果你能找到对单词/适应度函数贡献不大的列，那么也许可以翻转它们。

继续这样做很多代，看看每一代评分最高的作业是什么样子，你会期望有效的作业在某个时刻达到稳定状态，这将是你的正确作业。

这种方法只能比健身函数的蛮力好一点，但也可能会非常好。

最后一个想法：尝试从“第一列，第二列”中抽象出来，并将这些列分配成形成单词的块，因为仅仅因为 [1,4,6....] 结果形成了“the”“him” “她”等，并不意味着它一开始就属于。

我有一种我更喜欢的不同方法，我认为动态算法更适合这个。

编辑：另一种方法

再次基于字典方法，但您将专注于在其余列之前选择前几列，如果它崩溃并且您在任何特定行中都没有得到单词，则意味着您的之前的选择是错误的，您需要回溯。

选择第 1 行.. 很可能这里没有太多单词，但您会将自己缩小到字典的子集 - 该子集包含以第一列中的字符开头的单词。

现在您有一行可以使用，请选择右侧的相邻行。如果它形成完整的单词或仍然具有可能的有效单词（假设不存在表示单词结尾的空格）。
重复。

如果根据您之前的选择不可能有相邻行，则向左回退一行，并且不要再次选择相同的内容。

这里的弱点是你的字典需要包含句子中的所有单词，以及单词的所有变体。您可能需要想出一个类似于适应度函数的启发式方法，即“90% 的单词匹配，因此这仍然是一次有效的尝试……”或类似的内容。

A couple more ideas:

Quotes, for each open quote there must be an end quote after it.
Capital letters, usually start of sentence or noun etc (any additional grammar rules that apply
Use a small enough dictionary to fit all in memory, and count the number of valid words in a particular arrangement.

One way, although generally this approach is one of the most time consuming- is to use a genetic algorithm.

Lets say the current default arrangement of columns is

|de|  | f|Cl|nf|ed|au| i|ti|  |ma|ha|or|nn|ou| S|on|nd|on|
[0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16 17 18] <--- define this to be a chromosome

You could create a population of 100, 1000 w/e number of chromosomes that start off randomly assigned (keep in mind the 'random' assignment cannot have duplicate numbers and must be valid)

Then run a fitness function on each assignment, or multiple fitness functions if you would like to break it down that way. Start off with one super fitness function that assigns a fitness value to each assignment.

Only take the top 50% of chromosomes and move them onto the next generation, where you create 'children' chromosomes based on your choice of a crossover function and a probability of mutation- for this type of problem I recommend a very light crossover function (or none...) and a decent mutation rate. If you can find the columns that do not contribute to words/ the fitness function much then maybe flip those around.

Keep doing this for many generations and see how the top rated assignment looks like each generation, you would expect the valid to plateau at some point and that would be your correct assignment.

This approach could only be mildly better than brute force with fitness function, but it could also turn out to be pretty good.

One last idea: try to abstract away from 'first column, second column' and assign the columns into chunks that form words, because just because [1,4,6....] turns out to form "the" "him" "her" etc, doesnt mean it belongs right in the beginning.

I have a different approach that I kind of like better, I think a Dynamic Algorithm would be better suited for this.

EDIT: Another Approach

Again based on the dictionary approach, but you would focus on choosing the first few columns before the rest, and if it falls apart and you are not getting words in any particular row it means your earlier selections were wrong and you will need to backtrack.

Select row 1.. well chances are there are not too many words here but you will narrow yourself down to a subset of your dictionary- the subset that has words that start with the chars in your first column.

Now you have a row that works, select an adjacent row to the right. If it either forms full words or still has valid words possible (given no space is present signifying end of word).
Repeat.

If no adjacent rows are possible given your previous choices, backtrack one row to the left, and dont select the same thing again.

The weakness here is that your dictionary would need to contain all the words in your sentence, as well as all variants of the words. You might need to come up with a heuristic similar to a fitness function that says, "90% of words match, so this is still a valid attempt..." or something of that sort.

回复收藏 0 原文

~没有更多了~