文字不规范

发布于 2024-07-13 02:32:14 字数 415 浏览 9 评论 0原文

有人知道有一个库或软件可以定位文本中的不规则之处吗？例如，假设我...

1. Name 1, Comment
2. Name 2, Comment
3. Name 3 , Comment
5. Name 10, Comment

这个软件或库首先会剪切它会发现相似的文本部分（非常类似于一个压缩软件会对文本的重复相似部分进行编码以将其压缩），但使用变量为了容错，它可以找到文本的相似部分，现在非常类似于文本比较应用程序或差异/合并工具，它实际上可以突出显示它认为不同的内容。我正在考虑制作这个工具，但我不想重新发明轮子。如果有任何地方有任何远程能力可以做到这一点，我真的很想知道是否可以为这个项目提供帮助，或者至少知道不要做一个。更不用说这个答案可能会帮助其他人寻找同样的东西，我认为需求对于供应来说足够高，这就是为什么我难以置信我根本找不到任何东西。

原文

Does anybody know of a library or piece of software out there that will locate irregularities in text? For example, lets say I have...

1. Name 1, Comment
2. Name 2, Comment
3. Name 3 , Comment
5. Name 10, Comment

This software or library would first cut up portions of text that it would find similar (much alike a piece of compression software would encode repetitive similar portions of text to compress it down) but using a variable for error tolerance it could find similar portions of text, now much alike a text comparison application or diff/merge tool it could actually highlight what it sees as different. I'm thinking about possibly making this tool but I do not wish to reinvent the wheel. If there is anything out there anywhere remotely capable of this I would really like to know to possibly help on this project or at least know not to make one. Not to mention this answer could possibly help other people hunting for the same thing, I would think the demand would be high enough for the supply that's why it boggles my mind that I can't find anything at all.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

人生百味 2024-07-20 02:32:14

根据您想要发现或纠正现实生活中的不规则行为的类型，这个问题是完全不同的。

这是用真实文本更新的示例：

1. Lazarus Long, Get the first shot off fast.
2. Hiro Protagonist, Greatest swordfighter[sic] in the world.
3. Alice , Down the rabbit hole.
5. Orem, Sink of power.

在这个示例中，可以使用合适的文本编辑器通过查找替换来修复错误。如果您对通配符发挥创意，文本编辑器和十六进制编辑器可以创造奇迹。只要您的限制因素存在（. 或 ,），问题就仍然很简单。正如您可能已经知道的那样；一旦其中一个缺失，问题就会变得更加复杂。

一个难题的例子：

1. Lazarus Long, Get the first shot off fast.
 2. Hiro Protagonist  Greatest swordfighter[sic] in the world.
3. Alice , Down the rabbit hole.
5 . Orem, , Sink of power.

我可能会通过几个步骤来解决这个问题。
1.清理多余的空间。
2. 找出关键统计信息，例如每行分隔符的数量以及每个分隔列的平均单词或字符数。大多数名称都是一两个单词，注释未知或受输入限制。
3. 查找关键特征数量在统计上不可能的线路。
4、尽力改正。

我知道这并不能直接解决您的问题，但也许有一个想法可以暂时解决您的问题。过去的车轮工匠可能从未完成过任何设计。

Depending on what sort of real life irregularities you want to find or correct this problem is radically different.

Here is your example updated with real text:

1. Lazarus Long, Get the first shot off fast.
2. Hiro Protagonist, Greatest swordfighter[sic] in the world.
3. Alice , Down the rabbit hole.
5. Orem, Sink of power.

In this example the errors could be fixed with a decent text editor with find an replace. Text editors and hex editors can work miracles if you get creative with wildcards. The problem remains simple as long as your delimiting factors are in existence (. or ,). As you have probably already know; as soon as one of those is missing the problem becomes much more complex.

Example of a hard problem:

1. Lazarus Long, Get the first shot off fast.
 2. Hiro Protagonist  Greatest swordfighter[sic] in the world.
3. Alice , Down the rabbit hole.
5 . Orem, , Sink of power.

I would probably attack this in a few steps.
1. Clean up extra spaces.
2. Find out key statistics such as the number of delimiters per line and the avg number of words or characters per delimited column. Most names are one or two words, comments are unknown or limited by input.
3. Find lines with a statistically improbably number of key features.
4. Try your best to correct them.

I understand that this is not directly solving your problem, but maybe one idea can patch your problem over for a bit. It is possible that past wheel wrights never completed any designs.

回复收藏 0 原文