立法的 Diff 算法

发布于 2024-12-12 09:21:57 字数 279 浏览 8 评论 0原文

作为一个雄心勃勃的项目的一部分,我试图更好地理解美国国会提出的法案中写入的立法文本。我有最近账单的电子版本,并且正在尝试实现一种算法,将账单与之前的账单进行比较,寻找相似之处。假设许多失败的法案最终都会被纳入其他法案。

显然,这是一项艰巨的任务。关于不同引擎存在许多问题,但我的问题略有不同。很多时候提出的法案将多种想法整合在一起。因此,差异引擎需要比较账单的一部分,而不是整个账单。

关于差异算法或执行此操作的方法有什么建议吗?我可以使用强大的计算能力,但请记住,我将使用大约 100,000 张钞票的数据集。

As part of an ambitious project, I am attempting to better understand the legislative text that is written into bills introduced in the U.S. Congress. I have electronic versions of recent bills, and am attempting to implement an algorithm that would compare a bill with prior bills, looking for similarities. The hypothesis is that many bills that fail end up getting co-opted into other bills.

Obviously, this is a large task. Many questions exist regarding difference engines, but my issue is slightly different. Many times bills are introduced that package several ideas together. So the difference engine would need to compare portions of bills, not the entire bills.

Any recommendations on difference algorithms or a method to go about doing this? I have access to serious computational power, but do keep in mind that I will be using a dataset of about 100,000 bills.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

遮云壑 2024-12-19 09:21:57

看看 Simian - 相似度分析器。它适用于纯文本和代码。

Take a look at Simian - Similarity Analyser. It works for plain text as well as code.

江城子 2024-12-19 09:21:57

非常有趣的想法。我将首先研究 最长公共子序列 算法,并了解如何将它们调整为 (1) 报告任何超过某个阈值的序列,例如 20 个单词,并且 (2) 看看是否可以让它们处理一点模糊性,以防一两个单词发生变化。我建议首先查看差异代码。

Very interesting idea. I would start by looking into longest common subsequence algorithms, and see about adapting them to (1) report any sequence over some threshold, say, 20 words, and (2) see if you can get them to handle a bit of fuzziness, in case a word or two gets changed. I'd suggest looking at the diff code to start.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文