如何检测来自各种来源的表格数据

发布于 2024-08-11 06:48:28 字数 339 浏览 2 评论 0原文

在我正在玩的一个实验项目中,我希望能够查看文本数据并检测它是否包含表格格式的数据。当然,有很多情况看起来像表格数据,所以我想知道我需要研究什么样的算法来寻找共同特征。

我的第一个想法是编写一个长 switch/case 语句来检查由制表符分隔的数据,然后检查由管道符号分隔的数据的另一个情况,然后再检查以另一种方式分隔的数据的另一个情况等等。现在我当然意识到我必须列出要检测的不同事物的列表 - 但我想知道是否有一种比对每种类型进行相对较慢的搜索更智能的方法来检测这些特征。

我意识到这个问题并不是特别雄辩地提出,所以我希望它有一定道理!

有什么想法吗?

(也不知道如何标记它 - 所以欢迎帮助!)

In an experimental project I am playing with I want to be able to look at textual data and detect whether it contains data in a tabular format. Of course there are a lot of cases that could look like tabular data, so I was wondering what sort of algorithm I'd need to research to look for common features.

My first thought was to write a long switch/case statement that checked for data seperated by tabs, and then another case for data separated by pipe symbols and then yet another case for data separated in another way etc etc. Now of course I realize that I would have to come up with a list of different things to detect - but I wondered if there was a more intelligent way of detecting these features than doing a relatively slow search for each type.

I realize this question isn't especially eloquently put so I hope it makes some sense!

Any ideas?

(no idea how to tag this either - so help there is welcomed!)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

许久 2024-08-18 06:48:28

唯一可靠的方案是使用机器学习。例如,您可以在一堆表格和非表格材料的示例上训练感知器分类器。

The only reliable scheme would be to use machine-learning. You could, for example, train a perceptron classifier on a stack of examples of tabular and non-tabular materials.

倾其所爱 2024-08-18 06:48:28

混合解决方案可能是合适的,即您按照您的建议使用简单的启发式方法(以“类似开关”的方式处理)处理最常见/明显的情况,并留下更困难的情况,以便自动学习和其他类型的分类器逻辑。

A mixed solution might be appropriate, i.e. one whereby you handled the most common/obvious cases with simple heuristics (handled in "switch-like" manner) as you suggested, and to leave the harder cases, for automated-learning and other types of classifier-logic.

天冷不及心凉 2024-08-18 06:48:28

这假设您尚未在 TSV 中存储已定义的类型。

TSV 文件通常是
[Value1]\t[Value..N]\n

我的建议是:

  1. 计算所有选项卡
  2. 计算所有新行
  3. 计算第一行中的选项卡总数
  4. 将选项卡总数除以第一行中的选项卡数row

结果为 4,如果余数为 0,那么您就有了 TSV 文件的候选者。从那里,您可能想要执行以下操作:

  1. 您可以继续读取数据并忽略每行少于或多于预测制表符的行的错误
  2. 您可以在读取之前扫描每一行,以确保所有行都一致
  3. 您可以读取直到不适合格式的行,然后抛出错误

一旦您对制表符分隔值的数量有了很好的预测,您就可以使用正则表达式来解析这些值[作为一个组]。

This assumes that you do not already have a defined types stored in the TSV.

A TSV file is typically
[Value1]\t[Value..N]\n

My suggestion would be to:

  1. Count up all the tabs
  2. Count up all of new lines
  3. Count the total tabs in the first row
  4. Divide the total number of tabs by the tabs in the first row

With the result of 4, if you get a remainder of 0 then you have a candidate of TSV files. From there you may either want to do the following things:

  1. You can continue reading the data and ignoring the error of lines with less or more than the predicted tabs per line
  2. You can scan each line before reading to make sure all are consistent
  3. You can read up to the line that does not fit the format and then throw an error

Once you have a good prediction of the amount of tab separated values you can use a regular expression to parse out the values [as a group].

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文