用于交叉制表的良好数据模型是什么?

发布于 2024-07-24 17:55:53 字数 633 浏览 13 评论 0原文

我正在用 Python 实现一个交叉制表库,作为我新工作的编程练习,并且我已经实现了可以工作但不优雅且多余的需求。 我想要一个更好的模型,它允许在基本模型之间良好、干净地移动数据,以表格数据形式存储在平面文件中,以及可能需要的所有统计分析结果。

现在,我从表中每一行的一组元组,到计算感兴趣元组出现频率的直方图,再到一个序列化器(有点笨拙),将输出编译成一个集合用于显示的表格单元格。 然而,我最终不得不比我想要的更频繁地返回表格或直方图,因为永远没有足够的信息。

那么,有什么想法吗?

编辑:这是一些数据的示例,以及我希望能够从中构建的内容 它。 注意 ”。” 表示一些“缺失”数据,这只是有条件的 算了。

1   .   1
1   0   3
1   0   3
1   2   3
2   .   1
2   0   .
2   2   2
2   2   4
2   2   .

如果我查看上面第 0 列和第 2 列之间的相关性,这就是我的表格:

    . 1 2 3 4
1   0 1 0 3 0
2   2 1 1 0 1

此外,我希望能够计算频率/总计、频率/小计的比率,&c。

I'm implementing a cross-tabulation library in Python as a programming exercise for my new job, and I've got an implementation of the requirements that works but is inelegant and redundant. I'd like a better model for it, something that allows a nice, clean movement of data between the base model, stored as tabular data in flat files, and all of the statistical analysis results that might be asked of this.

Right now, I have a progression from a set of tuples for each row in the table, to a histogram counting the frequencies of the appearances of the tuples of interest, to a serializer that -- somewhat clumsily -- compiles the output into a set of table cells for display. However, I end up having to go back up to the table or to the histogram more often than I want to because there's never enough information in place.

So, any ideas?

Edit: Here's an example of some data, and what I want to be able to build from
it. Note that "." denotes a bit of 'missing' data, that is only conditionally
counted.

1   .   1
1   0   3
1   0   3
1   2   3
2   .   1
2   0   .
2   2   2
2   2   4
2   2   .

If I were looking at the correlation between columns 0 and 2 above, this is the table I'd have:

    . 1 2 3 4
1   0 1 0 3 0
2   2 1 1 0 1

In addition, I'd want to be able to calculate ratio of frequency/total, frequency/subtotal, &c.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

毁我热情 2024-07-31 17:55:53

您可以使用内存 sqlite 数据库作为数据结构,并将所需的操作定义为 SQL 查询。

import sqlite3

c = sqlite3.Connection(':memory:')
c.execute('CREATE TABLE data (a, b, c)')

c.executemany('INSERT INTO data VALUES (?, ?, ?)', [
    (1, None,    1),
    (1,    0,    3),
    (1,    0,    3),
    (1,    2,    3),
    (2, None,    1),
    (2,    0, None),
    (2,    2,    2),
    (2,    2,    4),
    (2,    2, None),
])

# queries
# ...

You could use an in-memory sqlite database as a data structure, and define the desired operations as SQL queries.

import sqlite3

c = sqlite3.Connection(':memory:')
c.execute('CREATE TABLE data (a, b, c)')

c.executemany('INSERT INTO data VALUES (?, ?, ?)', [
    (1, None,    1),
    (1,    0,    3),
    (1,    0,    3),
    (1,    2,    3),
    (2, None,    1),
    (2,    0, None),
    (2,    2,    2),
    (2,    2,    4),
    (2,    2, None),
])

# queries
# ...
吃不饱 2024-07-31 17:55:53

SW 在 activestate.com 上发布了一个很好的基本配方< /a>.

本质似乎是......

  1. 将 xsort=[] 和 ysort=[] 定义为轴数组。 通过迭代数据或其他方式来填充它们。
  2. 通过迭代数据并递增 rs[yvalue][xvalue],将 rs={} 定义为表格数据的字典的字典。 如果/需要时创建丢失的密钥。

例如,第 y 行的总计将是 sum([rs[y][x] for x in xsort])

S W has posted a good basic recipe for this on activestate.com.

The essence seems to be...

  1. Define xsort=[] and ysort=[] as arrays of your axes. Populate them by iterating through your data, or some other way.
  2. Define rs={} as a dict of dicts of your tabulated data, by iterating through your data and incrementing rs[yvalue][xvalue]. Create missing keys if/when needed.

Then for example the total for row y would be sum([rs[y][x] for x in xsort])

清旖 2024-07-31 17:55:53

由于这是 Python 的早期编程练习,因此他们可能希望您了解哪些 Python 内置机制适合问题的初始版本。 字典结构似乎是一个不错的选择。 tab-sep 文件中的第一列值可以是字典的键。 该键找到的条目本身可以是一个字典,其键是第二列值。 子词典的条目将是一个计数,当您在第一次遇到一对时添加新子词典时,初始化为 1。

Since this is an early programming exercise for Python, they probably want you to see what Python built-in mechanisms would be appropriate for the initial version of the problem. The dictionary structure seems a good candidate. The first column value from your tab-sep file can be the key into a dictionary. The entry found by that key can itself be a dictionary, whose key is the second column value. The entries of the subdictionary would be a count, initialized to 1 when you add a new subdictionary when a pair is first encountered.

路弥 2024-07-31 17:55:53

为什么不使用 HTML 表存储它? 它可能不是最好的,但您可以非常轻松地在浏览器中查看它。

编辑:

我刚刚重新阅读了这个问题,您要求的是数据模型,而不是存储模型。 要回答这个问题......

这完全取决于您将如何报告数据。 例如,如果您要进行大量的旋转或聚合,那么将其存储在列主要顺序中可能更有意义,这样您就可以对列进行求和来获取计数。

如果您解释一下您要提取的信息类型,将会有很大帮助。

Why not store it using HTML Tables? It might not be the best, but you could then, very easily, view it in a browser.

Edit:

I just re-read the question and you're asking for data model, not a storage model. To answer that question...

It all depends on how you're going to be reporting on the data. For example if you're going to be doing a lot of pivoting or aggregation it might make more sense to store it in column major order, this way you can just sum a column to get counts, for example.

It'll help a lot if you explain what kind of information you're trying to extract.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文