如何检测来自各种来源的表格数据
在我正在玩的一个实验项目中,我希望能够查看文本数据并检测它是否包含表格格式的数据。当然,有很多情况看起来像表格数据,所以我想知道我需要研究什么样的算法来寻找共同特征。
我的第一个想法是编写一个长 switch/case 语句来检查由制表符分隔的数据,然后检查由管道符号分隔的数据的另一个情况,然后再检查以另一种方式分隔的数据的另一个情况等等。现在我当然意识到我必须列出要检测的不同事物的列表 - 但我想知道是否有一种比对每种类型进行相对较慢的搜索更智能的方法来检测这些特征。
我意识到这个问题并不是特别雄辩地提出,所以我希望它有一定道理!
有什么想法吗?
(也不知道如何标记它 - 所以欢迎帮助!)
In an experimental project I am playing with I want to be able to look at textual data and detect whether it contains data in a tabular format. Of course there are a lot of cases that could look like tabular data, so I was wondering what sort of algorithm I'd need to research to look for common features.
My first thought was to write a long switch/case statement that checked for data seperated by tabs, and then another case for data separated by pipe symbols and then yet another case for data separated in another way etc etc. Now of course I realize that I would have to come up with a list of different things to detect - but I wondered if there was a more intelligent way of detecting these features than doing a relatively slow search for each type.
I realize this question isn't especially eloquently put so I hope it makes some sense!
Any ideas?
(no idea how to tag this either - so help there is welcomed!)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
唯一可靠的方案是使用机器学习。例如,您可以在一堆表格和非表格材料的示例上训练感知器分类器。
The only reliable scheme would be to use machine-learning. You could, for example, train a perceptron classifier on a stack of examples of tabular and non-tabular materials.
混合解决方案可能是合适的,即您按照您的建议使用简单的启发式方法(以“类似开关”的方式处理)处理最常见/明显的情况,并留下更困难的情况,以便自动学习和其他类型的分类器逻辑。
A mixed solution might be appropriate, i.e. one whereby you handled the most common/obvious cases with simple heuristics (handled in "switch-like" manner) as you suggested, and to leave the harder cases, for automated-learning and other types of classifier-logic.
这假设您尚未在 TSV 中存储已定义的类型。
TSV 文件通常是
[Value1]\t[Value..N]\n
我的建议是:
结果为 4,如果余数为 0,那么您就有了 TSV 文件的候选者。从那里,您可能想要执行以下操作:
一旦您对制表符分隔值的数量有了很好的预测,您就可以使用正则表达式来解析这些值[作为一个组]。
This assumes that you do not already have a defined types stored in the TSV.
A TSV file is typically
[Value1]\t[Value..N]\n
My suggestion would be to:
With the result of 4, if you get a remainder of 0 then you have a candidate of TSV files. From there you may either want to do the following things:
Once you have a good prediction of the amount of tab separated values you can use a regular expression to parse out the values [as a group].