避免 DataNucleus 连接？

发布于 2024-09-08 16:37:19 字数 3635 浏览 8 评论 0原文

我正在尝试将 JDBC Web 应用程序移动到 JDO DataNucleus 2.1.1。

假设我有一些类似这样的课程：

public class Position { 私有整数 ID；私有字符串标题； }

公共类员工 { 私有整数 ID；私有字符串名称；私人职位职位；。

Position SQL 表的内容实际上不会经常更改使用 JDBC，我将整个表读入内存（能够定期或随意刷新）。然后，当我将 Employee 读入内存时，我只需从 Employee 表中检索职位 ID，并使用它来获取内存中的 Position 实例。

但是，使用 DataNucleus，如果我迭代所有职位：

Extent<Position> extent =pm.getExtent(Position.class, true);
Iterator<Position> iter =extent.iterator();
while(iter.hasNext()) {
   Position position =iterPosition.next();
   System.out.println(position.toString());
}

然后，使用不同的 PersistenceManager，迭代所有员工，获取他们的职位：

Extent<Employee> extent =pm.getExtent(Employee.class, true);
Iterator<Employee> iter =extent.iterator();
while(iter.hasNext()) {
   Employee employee =iter.next();
   System.out.println(employee.getPosition());
}

然后，当我获得员工职位时，DataNucleus 似乎会生成连接两个表的 SQL：

SELECT A0。来自 MYSCHEMA.EMPLOYEE A0 的 POSITION_ID、B0.ID、B0.TITLE 左外连接 MYSCHEMA."POSITION" B0 ON A0.POSITION_ID = B0.ID，其中 A0.ID = <1>;

我的理解是，DataNucleus 将使用缓存的 Position 实例（如果可用）。（这是正确的吗？）但是，我担心联接会降低性能。我还没有足够的能力来运行基准测试。难道我的恐惧是多余的吗？我应该继续并进行基准测试吗？有没有办法让 DataNucleus 避免连接？

<jdo>
<package name="com.example.staff">
    <class name="Position" identity-type="application" schema="MYSCHEMA" table="Position">
        <inheritance strategy="new-table"/>
        <field name="id" primary-key="true">
            <column name="ID" jdbc-type="integer"/>
        </field>
        <field name="title">
            <column name="TITLE" jdbc-type="varchar"/>
        </field>
    </class>
</package>
</jdo>

<jdo>
<package name="com.example.staff">
    <class name="Employee" identity-type="application" schema="MYSCHEMA" table="EMPLOYEE">
        <inheritance strategy="new-table"/>
        <field name="id" primary-key="true">
            <column name="ID" jdbc-type="integer"/>
        </field>
        <field name="name">
            <column name="NAME" jdbc-type="varchar"/>
        </field>
        <field name="position" table="Position">
            <column name="POSITION_ID" jdbc-type="int" />
            <join column="ID" />
        </field>
    </class>
</package>
</jdo>

我想我希望能够做的是告诉 DataNucleus 继续读取 POSITION_ID int 作为默认获取组的一部分，并查看相应的 Position 是否已缓存。如果是这样，则设置该字段。如果没有，则稍后（如果需要）进行连接。更好的是，继续将该 int ID 存储在某处，并在稍后调用 getPosition() 时使用它。这将避免在所有情况下加入。

我认为了解类和主键值就足以避免天真的情况，但我对 DataNucleus 还不够了解。

根据我收到的有用反馈，我的 .jdo 现已清理完毕。但是，将 POSITION_ID 字段添加到默认提取组后，我仍然获得连接。

SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,A0.ID,A0."NAME",A0.POSITION_ID,B0.ID,B0.TITLE FROM MYSCHEMA.EMPLOYEE A0 LEFT OUTER JOIN MYSCHEMA."POSITION" B0 ON A0.POSITION_ID = B0.ID

我明白为什么要这样做，天真的方法总是有效的。我只是希望它能做更多的事情。尽管 DataNucleus 可能不会从结果集中读取所有列，而是返回缓存的位置，但它仍然调用数据存储来访问第二个表，以及所有需要的内容 - 包括可能的磁盘查找和读取。它将放弃这项工作这一事实并不能带来什么安慰。

我希望做的是告诉 DataNucleus 所有位置都将被缓存，相信我。如果由于某种原因你发现了一个不存在的情况，那就怪我造成了缓存未命中。我知道您必须（透明地）在位置表上执行单独的选择。（更好的是，固定由于缓存未命中而必须去获取的任何位置。这样，对象上就不会再次出现缓存未命中。）

这就是我现在使用 JDBC 所做的事情，通过道。研究持久层的原因之一是放弃这些 DAO。很难想象迁移到一个持久层，它无法超越简单的获取，从而导致昂贵的连接。

一旦 Employee 不仅有一个职位，还有一个部门和其他字段，一个 Employee 获取就会导致六个表被访问，即使所有这些对象已经固定在缓存中，并且在给定它们的类和属性的情况下是可寻址的主键。事实上，我可以自己实现这一点，将 Employee.position 更改为 Integer，创建 IntIdentity，并将其传递给 PersistenceManager.getObjectByID()。

我认为我听到的是 DataNucleus 无法进行这种优化。是这样吗？很好，只是不是我所期望的。

原文

I'm experimenting with moving a JDBC webapp to JDO DataNucleus 2.1.1.

Assume I have some classes that look something like this:

public class Position {
private Integer id;
private String title;
}

public class Employee {
private Integer id;
private String name;
private Position position;
}

The contents of the Position SQL table really don't change very often. Using JDBC, I read the entire table into memory (with the ability to refresh periodically or at-will). Then, when I read an Employee into memory, I simply retrieve the position ID from the Employee table and use that to obtain the in-memory Position instance.

However, using DataNucleus, if I iterate over all Positions:

Extent<Position> extent =pm.getExtent(Position.class, true);
Iterator<Position> iter =extent.iterator();
while(iter.hasNext()) {
   Position position =iterPosition.next();
   System.out.println(position.toString());
}

And then later, with a different PersistenceManager, iterate over all Employees, obtaining their Position:

Extent<Employee> extent =pm.getExtent(Employee.class, true);
Iterator<Employee> iter =extent.iterator();
while(iter.hasNext()) {
   Employee employee =iter.next();
   System.out.println(employee.getPosition());
}

Then DataNucleus appears to produce SQL joining the two tables when I obtain an Employee's Position:

SELECT A0.POSITION_ID,B0.ID,B0.TITLE FROM MYSCHEMA.EMPLOYEE A0 LEFT OUTER JOIN MYSCHEMA."POSITION" B0 ON A0.POSITION_ID = B0.ID WHERE A0.ID = <1>

My understanding is that DataNucleus will use a cached Position instance, when available. (Is that correct?) However, I'm concerned that the joins will degrade performance. I'm not yet far enough along to run benchmarks. Are my fears misplaced? Should I continue, and benchmark? Is there a way to have DataNucleus avoid the join?

<jdo>
<package name="com.example.staff">
    <class name="Position" identity-type="application" schema="MYSCHEMA" table="Position">
        <inheritance strategy="new-table"/>
        <field name="id" primary-key="true">
            <column name="ID" jdbc-type="integer"/>
        </field>
        <field name="title">
            <column name="TITLE" jdbc-type="varchar"/>
        </field>
    </class>
</package>
</jdo>

<jdo>
<package name="com.example.staff">
    <class name="Employee" identity-type="application" schema="MYSCHEMA" table="EMPLOYEE">
        <inheritance strategy="new-table"/>
        <field name="id" primary-key="true">
            <column name="ID" jdbc-type="integer"/>
        </field>
        <field name="name">
            <column name="NAME" jdbc-type="varchar"/>
        </field>
        <field name="position" table="Position">
            <column name="POSITION_ID" jdbc-type="int" />
            <join column="ID" />
        </field>
    </class>
</package>
</jdo>

I guess what I'm hoping to be able to do is tell DataNucleus to go ahead and read the POSITION_ID int as part of the default fetch group, and see if the corresponding Position is already cached. If so, then set that field. If not, then do the join later, if called upon. Better yet, go ahead and stash that int ID somewhere, and use it if getPosition() is later called. That would avoid the join in all cases.

I would think that knowing the class and the primary key value would be enough to avoid the naive case, but I don't yet know enough about DataNucleus.

With the helpful feedback I've received, my .jdo is now cleaned up. However, after adding the POSITION_ID field to the default fetch group, I'm still getting a join.

SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,A0.ID,A0."NAME",A0.POSITION_ID,B0.ID,B0.TITLE FROM MYSCHEMA.EMPLOYEE A0 LEFT OUTER JOIN MYSCHEMA."POSITION" B0 ON A0.POSITION_ID = B0.ID

I understand why it is doing that, the naive method will always work. I was just hoping it was capable of more. Although DataNucleus might not read all columns from the result set, but rather return the cached Position, it is still calling upon the datastore to access a second table, with all that entails - including possible disk seeks and reads. The fact that it will throw that work away is little consolation.

What I was hoping to do was tell DataNucleus that all Positions will be cached, trust me on that. And if for some reason you find one that isn't, blame me for the cache miss. I understand that you'll have to (transparently) perform a separate select on the Position table. (Even better, pin any Positions you do have to go fetch due to a cache miss. That way there won't be a cache miss on the object again.)

That is what I'm doing now using JDBC, by way of a DAO. One of the reasons for investigating a persistence layer was to ditch these DAOs. It is difficult to imagine moving to a persistence layer that can't move beyond naive fetches resulting in expensive joins.

As soon as Employee has not only a Position, but a Department, and other fields, an Employee fetch causes a half dozen tables to be accessed, even though all of those objects are already pinned in the cache, and are addressable given their class and primary key. In fact, I can implement this myself, changing Employee.position to an Integer, creating an IntIdentity, and passing it to PersistenceManager.getObjectByID().

What I think I'm hearing is that DataNucleus is not capable of this optimization. Is that right? It's fine, just not what I expected.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

束缚ｍ 2024-09-15 16:37:19

默认情况下，从数据存储中获取 Employee 实体时不会执行连接，只有在实际读取 Employee.position 时才会执行连接（这称为延迟加载）。

此外，可以使用二级缓存。首先检查二级缓存是否确实启用（在 DataNucleus 1.1 中默认禁用，在 2.0 中默认启用）。然后，您可能应该“固定”该类，以便无限期地缓存 Position 实体：

但是，如果其他应用程序使用相同的数据库，2 级缓存可能会导致问题，因此我建议仅对 Position 等类启用它很少改变。对于其他类，将“可缓存”属性设置为 false（默认为 true）。

编辑添加：

元数据中的标签不适合这种情况。事实上，您根本不需要显式指定关系，DataNucleus 会从类型中找出它。但是，当您说需要在默认提取组中读取 POSITION_ID 时，您是对的。这都可以通过对元数据进行以下更改来实现：

<field name="position" default-fetch-group="true">
    <column name="POSITION_ID" jdbc-type="int" />
</field>

编辑添加：

只是为了澄清，在进行上述元数据更改后，我运行了您提供的测试代码（由 MySQL 数据库支持）我只看到这两个查询：

SELECT 'com.example.staff.Position' AS NUCLEUS_TYPE,`THIS`.`ID`,`THIS`.`TITLE` FROM `POSITION` `THIS` FOR UPDATE
SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,`THIS`.`ID`,`THIS`.`NAME`,`THIS`.`POSITION_ID` FROM `EMPLOYEE` `THIS` FOR UPDATE

如果我只运行代码的第二部分（Employee 范围），那么我只能看到第二个查询，而根本无法访问 POSITION 表。为什么？因为 DataNucleus 最初提供“空心”Position 对象，并且从 Object 继承的 Position.toString() 的默认实现不访问任何内部字段。如果我重写 toString() 方法以返回职位的标题，然后运行示例代码的第二部分，则对数据库的调用为：（

SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,`THIS`.`ID`,`THIS`.`NAME`,`THIS`.`POSITION_ID` FROM `EMPLOYEE` `THIS` FOR UPDATE
SELECT `A0`.`TITLE` FROM `POSITION` `A0` WHERE `A0`.`ID` = <2> FOR UPDATE
SELECT `A0`.`TITLE` FROM `POSITION` `A0` WHERE `A0`.`ID` = <1> FOR UPDATE

依此类推，每个职位实体一次获取）。正如您所看到的，没有执行任何连接，因此我很惊讶地听到您的体验有所不同。

关于您希望缓存如何工作的描述，这就是固定类时 2 级缓存应该如何工作的方式。事实上，我什至不会在应用程序启动时尝试将 Position 对象预加载到缓存中。让 DN 累积缓存即可。

确实，如果您采用 JDO，您可能必须接受一些妥协……您将必须放弃通过手动基于 JDBC 的 DAO 获得的绝对控制权。但在这种情况下，至少你应该能够实现你想要的。它确实是二级缓存的典型用例之一。

By default, a join will not be done when the Employee entity is fetched from the datastore, it will only be done when Employee.position is actually read (this is called lazy loading).

Additionally, this second fetch can be avoided using the level 2 cache. First check that the level 2 cache is actually enabled (in DataNucleus 1.1 it is disabled by default, in 2.0 it is enabled by default). You should probably then "pin" the class so that the Position entities it will be cached indefinitely:

The level 2 cache can cause issues if other applications use the same database, however, so I would recommend only enabling it for classes such as Position which are rarely changed. For other classes, set the "cacheable" attribute to false (default is true).

EDITED TO ADD:

The <join> tag in your metadata is not suitable for this situation. In fact you don't need to specify the relationship explicitly at all, DataNucleus will figure it out from the types. But you are right when you say that you need POSITION_ID to be read in the default fetch group. This can all be achieved with the following change to your metadata:

<field name="position" default-fetch-group="true">
    <column name="POSITION_ID" jdbc-type="int" />
</field>

EDITED TO ADD:

Just to clarify, after making the metadata change descibed above I ran the test code which you provided (backed by a MySQL database) and I saw only these two queries:

SELECT 'com.example.staff.Position' AS NUCLEUS_TYPE,`THIS`.`ID`,`THIS`.`TITLE` FROM `POSITION` `THIS` FOR UPDATE
SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,`THIS`.`ID`,`THIS`.`NAME`,`THIS`.`POSITION_ID` FROM `EMPLOYEE` `THIS` FOR UPDATE

If I run only the second part of the code (the Employee extent), then I see only the second query, without any access to the POSITION table at all. Why? Because DataNucleus initially provides "hollow" Position objects and the default implementation of Position.toString() inherited from Object doesn't access any internal fields. If I override the toString() method to return the position's title, and then run the second part of your sample code, then the calls to the database are:

SELECT 'com.example.staff.Employee' AS NUCLEUS_TYPE,`THIS`.`ID`,`THIS`.`NAME`,`THIS`.`POSITION_ID` FROM `EMPLOYEE` `THIS` FOR UPDATE
SELECT `A0`.`TITLE` FROM `POSITION` `A0` WHERE `A0`.`ID` = <2> FOR UPDATE
SELECT `A0`.`TITLE` FROM `POSITION` `A0` WHERE `A0`.`ID` = <1> FOR UPDATE

(and so on, one fetch per Position entity). As you can see, there are no joins being performed, and so I'm surprised to hear that your experience is different.

Regarding your description of how you hope caching should work, that is how the level 2 cache ought to work when a class is pinned. In fact, I wouldn't even bother trying to pre-load Position objects into the cache at application start-up. Just let DN cache them cumulatively.

It's true that you may have to accept some compromises if you adopt JDO...you'll have to relinquish the absolute control that you get with hand-rolled JDBC-based DAOs. But in this case at least you should be able to achieve what you want. It really is one of the archetypal use cases for the level 2 cache.

回复收藏 0 原文

面犯桃花 2024-09-15 16:37:19

添加托德的回复，澄清一些事情。

A <加入> 1-1 关系上的标签没有任何意义。好吧，它可以解释为“创建一个连接表来存储这种关系”，但是 DataNucleus 不支持这样的概念，因为最佳实践是在所有者或相关表中使用 FK。因此删除
1-1 关系上的“表”表明它存储在辅助表中，但您也不希望如此，因此请将其删除。
您检索 Position 对象，因此它发出类似的内容

SELECT 'org.datanucleus.test.Position' AS NUCLEUS_TYPE,A0.ID,A0.TITLE FROM "POSITION" A0

您检索 Employee 对象，因此它发出类似的内容

SELECT 'org.datanucleus.test.Employee' AS NUCLEUS_TYPE,A0.ID,A0."NAME" FROM EMPLOYEE A0

请注意，此处它不会检索该职位的 FK由于该字段不在默认获取组中（延迟加载）

您访问 Employee 对象的位置字段，因此它需要 FK 检索（因为它不知道哪个 Position 对象与该 Employee 相关），因此它发出

SELECT A0.POSITION_ID,B0.ID,B0.TITLE FROM EMPLOYEE A0 LEFT OUTER JOIN "POSITION" B0 ON A0.POSITION_ID = B0.ID WHERE A0.ID = ?

At此时它不需要检索 Position 对象，因为它已经存在（在缓存中），因此该对象被返回。

恕我直言，所有这些都是预期行为。您可以将 Employee 的“position”字段放入其默认获取组中，并且将在步骤 4 中检索该 FK，从而删除一个 SQL 调用。

Adding on to Todd's reply, to clarify a few things.

A <join> tag on a 1-1 relation means nothing. Well it could be interpreted as saying "create a join table to store this relationship", but then DataNucleus doesn't support such a concept since best practice is to use a FK in either owner or related table. So remove the <join>
A "table" on a 1-1 relation suggest that it is stored in a secondary table, yet you don't want that either, so remove it.
You retrieve Position objects, so it issues something like

SELECT 'org.datanucleus.test.Position' AS NUCLEUS_TYPE,A0.ID,A0.TITLE FROM "POSITION" A0

You retrieve Employee objects, so it issues something like

SELECT 'org.datanucleus.test.Employee' AS NUCLEUS_TYPE,A0.ID,A0."NAME" FROM EMPLOYEE A0

Note that it doesn't retrieve the FK for the position here since that field is not in the default fetch group (lazy loaded)

You access the position field of an Employee object, so it needs the FK retrieving (since it doesn't know which Position object relates to this Employee), so it issues

SELECT A0.POSITION_ID,B0.ID,B0.TITLE FROM EMPLOYEE A0 LEFT OUTER JOIN "POSITION" B0 ON A0.POSITION_ID = B0.ID WHERE A0.ID = ?

At this point it doesn't need to retrieve the Position object since it is already present (in the cache), so that object is returned.

All of this is expected behaviour IMHO. You could put the "position" field of Employee into its default fetch group and that FK would be retrieved in step 4, hence removing one SQL call.

回复收藏 0 原文

~没有更多了~