如何有效存储定期报告不断变化的数据集(搜索结果)

发布于 2024-09-15 08:18:31 字数 806 浏览 2 评论 0原文

我无法想出一种存储不断变化的数据集的好方法。

我想跟踪并定期报告特定网站的内容。例如,对于某个网站,我想跟踪所有可用的 PDF 文档。然后我想定期(比如每季度)报告文档数量、PDF 版本号和各种其他统计数据。此外,我想跟踪这些指标随时间的变化。例如,我想绘制出网站上提供的 PDF 文档随时间的增长情况。

我的输入基本上是一长串指向网站上所有 PDF 文档的 URL 列表。这些输入间歇性地到达,但它们可能与我想要运行报告的日期不一致。例如,在 2010 年第四季度,我可能会收到两个相隔几周的 URL 列表。 2011 年第一季度我可能只得到一个。

我无法弄清楚如何有效地将这些输入数据存储在某种数据库中,以便我可以轻松生成正确的报告。

一方面,每次收到新列表时,我可以简单地将完整列表以及导入日期插入表中。但我担心这个表会在短时间内变得很大,而且大部分都是重复的 URL。

但是,另一方面,我担心维护唯一 URL 或文档的列表可能会变得相当复杂。特别是随着时间的推移添加、删除然后重新添加文档时。我担心我可能会陷入创建时态数据库的复杂性。我不禁想到当文档本身更新但 URL 保持不变时会发生什么(在这种情况下元数据可能会改变,例如 PDF 版本、文件大小等)。

谁能推荐我一种存储这些数据的好方法,以便我可以从中生成报告?我特别希望能够追溯生成报告。例如,当我想要跟踪 2011 年第一季度的新网站时,我希望能够根据 2010 年第四季度的数据生成报告,即使 2011 年第一季度的数据已经导入。

提前致谢!

I am having trouble coming up with a good way to store a dataset that continually changes.

I want to track and periodically report on the contents of specific websites. For example, for a certain website I want to keep track of all the PDF documents that are available. Then I want to report periodically (say, quarterly) on the number of documents, PDF version number and various other statistics. In addition, I want to track the change of these metric over time. E.g. I want to graph the increase in PDF documents offered on the website over time.

My input is basically a long list of URLs that point to all the PDF documents on the website. These inputs arrive intermittently, but they may not coincide with the dates I want to run the reports on. For example, in Q4 2010 I may get two lists of URLs, several weeks apart. In Q1 2011 I may get just one.

I am having trouble figuring out how to efficiently store this input data in a database of some sorts so that I can easily generate the correct reports.

On the one hand, I could simply insert the complete list into a table each time I recieve a new list, along with a date of import. But I fear that the table will grow quite big in a short time, and most of it will be duplicate URLs.

But, on the other hand I fear that it may get quite complicated to maintain a list of unique URLs or documents. Especially when documents are added, removed and then re-added over time. I fear I might get into the complexities of creating a temporal database. And I shudder to think what happens when the document itself is updated but the URL stays the same (in that case the metadata might change, such as the PDF version, file size, etcetera).

Can anyone recommend me a good way to store this data so I can generate reports from it? I would especially like to have the ability to retroactively generate reports. E.g, when I want to track a new website in Q1 2011, I would like to be able to generate a report from both the Q4 2010 data as well, even though the Q1 2011 data has already been imported.

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

知你几分 2024-09-22 08:18:31

为什么不只是一个名为 URL_HISTORY 之类的表:

URL          VARCHAR  (PK)
START_DATE   DATE     (PK)
END_DATE     DATE
VERSION      VARCHAR

将 END_DATE 设置为 NULL 或合适的虚拟日期(例如 9999 年 12 月 31 日),其中版本尚未被取代;将 END_DATE 设置为版本被取代的最后一个有效日期,并为新版本创建新记录 - 例如。

+------------------+-------------+--------------+---------+
|URL               | START_DATE  |  END_DATE    | VERSION |
|..\Harry.pdf      | 01-OCT-2009 |  31-DEC-9999 | 1.1.0   |
|..\SarahJane.pdf  | 01-OCT-2009 |  31-DEC-2009 | 1.1.0   |
|..\SarahJane.pdf  | 01-JAN-2010 |  31-DEC-9999 | 1.1.1   |
+------------------+-------------+--------------+---------+

Why not just a single table, called something like URL_HISTORY:

URL          VARCHAR  (PK)
START_DATE   DATE     (PK)
END_DATE     DATE
VERSION      VARCHAR

Have END_DATE as either NULL or a suitable dummy date (eg. 31-Dec-9999) where the version has not been superceded; set END_DATE to be the last valid date where the version has been superceded, and create a new record for the new version - eg.

+------------------+-------------+--------------+---------+
|URL               | START_DATE  |  END_DATE    | VERSION |
|..\Harry.pdf      | 01-OCT-2009 |  31-DEC-9999 | 1.1.0   |
|..\SarahJane.pdf  | 01-OCT-2009 |  31-DEC-2009 | 1.1.0   |
|..\SarahJane.pdf  | 01-JAN-2010 |  31-DEC-9999 | 1.1.1   |
+------------------+-------------+--------------+---------+
看透却不说透 2024-09-22 08:18:31

怎么样使用文档数据库,而不是保存每个 url,而是保存一个包含 url 集合的文档。此时,每当您执行迭代所有 url 的任何流程时,您都会获得存在某个时间范围的所有文档或您对此拥有的任何资格,然后在每个文档中运行所有 url。

这也可以在 sql server 中模拟,只需将对象序列化为 json 或 xml 并将输出存储在合适的列中。

What about using a document database and instead of saving each url you save a document that has a collection of urls. At this point whenever you execute whatever process that iterates over all the urls you get all of the documents that existing a time frame or whatever qualifications you have on that and then run all of the urls across each of the documents.

This could also be emulated in sql server by just serializing your object to json or xml and storing the output in a fitting column.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文