用于交叉制表的良好数据模型是什么？

发布于 2024-07-24 17:55:53 字数 633 浏览 13 评论 0原文

我正在用 Python 实现一个交叉制表库，作为我新工作的编程练习，并且我已经实现了可以工作但不优雅且多余的需求。我想要一个更好的模型，它允许在基本模型之间良好、干净地移动数据，以表格数据形式存储在平面文件中，以及可能需要的所有统计分析结果。

现在，我从表中每一行的一组元组，到计算感兴趣元组出现频率的直方图，再到一个序列化器（有点笨拙），将输出编译成一个集合用于显示的表格单元格。然而，我最终不得不比我想要的更频繁地返回表格或直方图，因为永远没有足够的信息。

那么，有什么想法吗？

编辑：这是一些数据的示例，以及我希望能够从中构建的内容它。注意 ”。” 表示一些“缺失”数据，这只是有条件的算了。

如果我查看上面第 0 列和第 2 列之间的相关性，这就是我的表格：

    . 1 2 3 4
1   0 1 0 3 0
2   2 1 1 0 1

此外，我希望能够计算频率/总计、频率/小计的比率，&c。

原文

I'm implementing a cross-tabulation library in Python as a programming exercise for my new job, and I've got an implementation of the requirements that works but is inelegant and redundant. I'd like a better model for it, something that allows a nice, clean movement of data between the base model, stored as tabular data in flat files, and all of the statistical analysis results that might be asked of this.

Right now, I have a progression from a set of tuples for each row in the table, to a histogram counting the frequencies of the appearances of the tuples of interest, to a serializer that -- somewhat clumsily -- compiles the output into a set of table cells for display. However, I end up having to go back up to the table or to the histogram more often than I want to because there's never enough information in place.

So, any ideas?

Edit: Here's an example of some data, and what I want to be able to build from
it. Note that "." denotes a bit of 'missing' data, that is only conditionally
counted.

If I were looking at the correlation between columns 0 and 2 above, this is the table I'd have:

    . 1 2 3 4
1   0 1 0 3 0
2   2 1 1 0 1

In addition, I'd want to be able to calculate ratio of frequency/total, frequency/subtotal, &c.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毁我热情 2024-07-31 17:55:53

您可以使用内存 sqlite 数据库作为数据结构，并将所需的操作定义为 SQL 查询。

import sqlite3

c = sqlite3.Connection(':memory:')
c.execute('CREATE TABLE data (a, b, c)')

c.executemany('INSERT INTO data VALUES (?, ?, ?)', [
    (1, None,    1),
    (1,    0,    3),
    (1,    0,    3),
    (1,    2,    3),
    (2, None,    1),
    (2,    0, None),
    (2,    2,    2),
    (2,    2,    4),
    (2,    2, None),
])

# queries
# ...

You could use an in-memory sqlite database as a data structure, and define the desired operations as SQL queries.

import sqlite3

c = sqlite3.Connection(':memory:')
c.execute('CREATE TABLE data (a, b, c)')

c.executemany('INSERT INTO data VALUES (?, ?, ?)', [
    (1, None,    1),
    (1,    0,    3),
    (1,    0,    3),
    (1,    2,    3),
    (2, None,    1),
    (2,    0, None),
    (2,    2,    2),
    (2,    2,    4),
    (2,    2, None),
])

# queries
# ...

回复收藏 0 原文

吃不饱 2024-07-31 17:55:53

SW 在 activestate.com 上发布了一个很好的基本配方< /a>.

本质似乎是......

将 xsort=[] 和 ysort=[] 定义为轴数组。通过迭代数据或其他方式来填充它们。
通过迭代数据并递增 rs[yvalue][xvalue]，将 rs={} 定义为表格数据的字典的字典。如果/需要时创建丢失的密钥。

例如，第 y 行的总计将是 sum([rs[y][x] for x in xsort])

回复收藏 0 原文

清旖 2024-07-31 17:55:53

由于这是 Python 的早期编程练习，因此他们可能希望您了解哪些 Python 内置机制适合问题的初始版本。字典结构似乎是一个不错的选择。 tab-sep 文件中的第一列值可以是字典的键。该键找到的条目本身可以是一个字典，其键是第二列值。子词典的条目将是一个计数，当您在第一次遇到一对时添加新子词典时，初始化为 1。

回复收藏 0 原文