独立于 DBMS 的查询
我的硕士论文是关于通过分析元数据和存储的数据来发现糟糕的数据库设计。我们通过从给定的 DBMS 中提取元数据模型,然后对此元数据运行一组规则来实现此目的。
为了通过数据分析来扩展这个过程,我们需要允许规则直接查询数据库,但我们必须保留DBMS的独立性,这样查询就可以应用于PostgreSQL、MSSQL和MySQL。
我们已经讨论了一种查询的功能构造,例如:
new Query(new Select(columnID), new From(tableID), new Where(new Equality(columnID1, columnID2)))
然后使用特定于 DBMS 的序列化器。
另一种方法是让规则自己处理这一切:
public Query QueryDatabase(DBMS dbms)
{
if (dbms == PostgreSQL) { return "select count(1) from Users"}
if (dbms == MSSQL) {return ....}
}
我们是否遗漏了什么?所有这些实际上都存在于某个不错的图书馆中吗?是的,我们已经研究过实体框架,但它们似乎依赖于数据库的静态类型模型,而由于显而易见的原因无法创建该模型。
我应该提到的是,我们维护一个可扩展的规则架构,允许最终用户实现自己的规则。
为了阐明我们想要实现的目标,请看下面的查询(mssql),它需要两个参数,表的名称(@table)和列的名称(@column):
DECLARE @TotalCount FLOAT;
SELECT @TotalCount = COUNT(1) FROM [@table];
SELECT SUM(pcount * LOG10(@TotalCount / pcount)) / (LOG10(2) * @TotalCount)
FROM (SELECT (Count([@column])) as pcount
FROM [@table]
GROUP BY [@column]) as exp1
该查询测量存储的信息量在给定的属性中,通过估计熵。它需要访问表中的所有行。为了避免从数据库中提取所有行并通过慢速网络连接传输它们,最好用 SQL 表达它们,并且仅传输单个数字。
注意:我们确实拥有我们需要的所有元数据。此问题仅用于访问数据!
我不太确定是否要将其添加到我已经很长的问题中,编辑现有答案或做什么。请随时提供建议。 ;)
基于 mrnye 答案:
new Query()
.Variable(varname => FLOAT)
.Set(varname => new Query().Count(1).From(table) )
.Select(new Aggregate().Sum(varname => "pcount * LOG10(varname / pcount)"))
.From(
new Query()
.Select(pcount => new Aggregate().Count(column)
.From(table)
.GroupBy(column)
)
除了语法错误和 lambda 语句的误用之外,我还考虑了使用一些扩展方法来构建查询的想法。这看起来确实是一个相当复杂的方法。您如何看待这种方法?
建立在 LINQ 答案的基础上:
let totalCount = Table.Count
from uv un from r in Table
group r by r["attr"]
select r.Count
select r.Count * Log2((totalCount / r.Count))
看起来相当不错,但要实现的工作量很大......
My masters thesis is about discovering bad database design by analyzing metadata and the data stored. We do this by extracting a metadata model from a given DBMS and then running a set of rules on this metadata.
To extend this process with data analysis, we need to allow rules to query the database directly, but we must retain DBMS independence, such that queries can be applied to PostgreSQL, MSSQL and MySQL.
We have discussed a sort of functional construction of queries such as:
new Query(new Select(columnID), new From(tableID), new Where(new Equality(columnID1, columnID2)))
And then using a DBMS-specific serializer.
Another approach is to let rules handle it all by themselves:
public Query QueryDatabase(DBMS dbms)
{
if (dbms == PostgreSQL) { return "select count(1) from Users"}
if (dbms == MSSQL) {return ....}
}
Are we missing something? Does all this in fact exist in a nice library somewhere? And yes, we have looked at Entity frameworks, but they seem to rely on a statically types model of the database, which for obvious reasons cannot be created.
I should mention that we maintain an extensible rule architecture, allowing end users to implement their own rules.
To clarify what we want to achieve, look at the following query (mssql), it needs two parameters, the name of the table (@table) and the name of the column (@column):
DECLARE @TotalCount FLOAT;
SELECT @TotalCount = COUNT(1) FROM [@table];
SELECT SUM(pcount * LOG10(@TotalCount / pcount)) / (LOG10(2) * @TotalCount)
FROM (SELECT (Count([@column])) as pcount
FROM [@table]
GROUP BY [@column]) as exp1
The query measures the amount of information stored in a given attribute, by estimating the entropy. It needs to access all rows in the table. To avoid extracting all rows from the database and transferring them over a slow network connection it is better to express them in SQL an only transfer a single number.
NOTE: We DO have all the metadata we need. This question is only for accessing data!
I was not very sure of whether to add this to my already long question, edit an existing answer or what todo. Please feel free to advise. ;)
Building on mrnye answer:
new Query()
.Variable(varname => FLOAT)
.Set(varname => new Query().Count(1).From(table) )
.Select(new Aggregate().Sum(varname => "pcount * LOG10(varname / pcount)"))
.From(
new Query()
.Select(pcount => new Aggregate().Count(column)
.From(table)
.GroupBy(column)
)
Syntax errors and misuse of lambda statements aside, i played with the idea of using some extension methods for building queries. It does seem as a fairly complex approach. How would you think about such an approach?
Building on the LINQ answer:
let totalCount = Table.Count
from uv un from r in Table
group r by r["attr"]
select r.Count
select r.Count * Log2((totalCount / r.Count))
Seems fairly nice, but a helluva lot to implement...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以通过实现自定义 LINQ 提供程序来实现相同的效果基础设施。查询是通用的,但是 AST 树访问者 生成 SQL 查询的工具可以是可插入的。您甚至可以使用内存数据存储来模拟数据库,并将自定义 LINQ 查询转换为 LINQ to 对象查询!
您需要创建一个知道如何从对象的索引器中提取列名的提供程序。这是一个可以扩展的基本框架:
You could achieve the same by implementing a custom LINQ provider infrastructure. The queries are generic, but the AST tree visitors that generate the SQL queries can be made pluggable. You can even mock a database using a in memory data store and translating your custom LINQ query to a LINQ to objects query!
You would need to create a provider that would know how to extract the column name from the object's indexer. Here is a basic framework that you can extend:
我认为 LINQ 路线是可行的方法,但为了好玩,我尝试想出一个解决方案。它需要一些工作,但总体思路是让查询接口流畅并将实现逻辑隐藏在接口后面。只是把它扔在那里作为思考的食物......
I think the LINQ route is the way to go, but for fun I tried to think of a solution. It needs some work, but the general idea is to have the query interface fluent and hide the implementation logic behind interfaces. Just throwing it out there as food for thought...
检索数据库信息的最独立于 DBMS 的方法是 iNFORMATION_SCHEMA。请参阅 MySQL、SQL Server,PostgreSQL。
我有点好奇你在想什么类型的规则。 :)
The most DBMS-independent way to retrieve information about a database is iNFORMATION_SCHEMA. See MySQL, SQL Server, PostgreSQL.
I'm kind of curious what type of rules you're thinking of. :)
NHibernate 怎么样? http://community.jboss.org/wiki/NHibernateforNET
What about NHibernate? http://community.jboss.org/wiki/NHibernateforNET