文字不规范
有人知道有一个库或软件可以定位文本中的不规则之处吗? 例如,假设我...
1. Name 1, Comment 2. Name 2, Comment 3. Name 3 , Comment 5. Name 10, Comment
这个软件或库首先会剪切它会发现相似的文本部分(非常类似于一个压缩软件会对文本的重复相似部分进行编码以将其压缩),但使用变量为了容错,它可以找到文本的相似部分,现在非常类似于文本比较应用程序或差异/合并工具,它实际上可以突出显示它认为不同的内容。 我正在考虑制作这个工具,但我不想重新发明轮子。 如果有任何地方有任何远程能力可以做到这一点,我真的很想知道是否可以为这个项目提供帮助,或者至少知道不要做一个。 更不用说这个答案可能会帮助其他人寻找同样的东西,我认为需求对于供应来说足够高,这就是为什么我难以置信我根本找不到任何东西。
Does anybody know of a library or piece of software out there that will locate irregularities in text? For example, lets say I have...
1. Name 1, Comment 2. Name 2, Comment 3. Name 3 , Comment 5. Name 10, Comment
This software or library would first cut up portions of text that it would find similar (much alike a piece of compression software would encode repetitive similar portions of text to compress it down) but using a variable for error tolerance it could find similar portions of text, now much alike a text comparison application or diff/merge tool it could actually highlight what it sees as different. I'm thinking about possibly making this tool but I do not wish to reinvent the wheel. If there is anything out there anywhere remotely capable of this I would really like to know to possibly help on this project or at least know not to make one. Not to mention this answer could possibly help other people hunting for the same thing, I would think the demand would be high enough for the supply that's why it boggles my mind that I can't find anything at all.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
根据您想要发现或纠正现实生活中的不规则行为的类型,这个问题是完全不同的。
这是用真实文本更新的示例:
在这个示例中,可以使用合适的文本编辑器通过查找替换来修复错误。 如果您对通配符发挥创意,文本编辑器和十六进制编辑器可以创造奇迹。 只要您的限制因素存在(. 或 ,),问题就仍然很简单。 正如您可能已经知道的那样; 一旦其中一个缺失,问题就会变得更加复杂。
一个难题的例子:
我可能会通过几个步骤来解决这个问题。
1.清理多余的空间。
2. 找出关键统计信息,例如每行分隔符的数量以及每个分隔列的平均单词或字符数。 大多数名称都是一两个单词,注释未知或受输入限制。
3. 查找关键特征数量在统计上不可能的线路。
4、尽力改正。
我知道这并不能直接解决您的问题,但也许有一个想法可以暂时解决您的问题。 过去的车轮工匠可能从未完成过任何设计。
Depending on what sort of real life irregularities you want to find or correct this problem is radically different.
Here is your example updated with real text:
In this example the errors could be fixed with a decent text editor with find an replace. Text editors and hex editors can work miracles if you get creative with wildcards. The problem remains simple as long as your delimiting factors are in existence (. or ,). As you have probably already know; as soon as one of those is missing the problem becomes much more complex.
Example of a hard problem:
I would probably attack this in a few steps.
1. Clean up extra spaces.
2. Find out key statistics such as the number of delimiters per line and the avg number of words or characters per delimited column. Most names are one or two words, comments are unknown or limited by input.
3. Find lines with a statistically improbably number of key features.
4. Try your best to correct them.
I understand that this is not directly solving your problem, but maybe one idea can patch your problem over for a bit. It is possible that past wheel wrights never completed any designs.
如果您喜欢 Python,您可以尝试 difflib。
这不是您问题的确切解决方案,但可能会有所帮助。
If you are into Python, you might try difflib.
It's not an exact solution to your problem, but it might be helpful.
听起来基本上就像您想要使用正则表达式来创建“理想响应”,然后将其余行与其进行比较。
或者您可以编写一个更复杂的程序,将每一行归结为正则表达式查询,然后将查询相互比较以查看哪些是不同的。
Sounds basically like you'd want to use Regex to create an "ideal response" then compare the rest of the lines against it.
Or you could write a more complicated program which would boil each line down into a Regex query, and then compare the queries to each other to see which ones are different.