将 Jena OntModel 与 bnode 同步
此问题与 rcreswick 关于 序列化 Jena OntModel 更改。 我在两台(或更多)机器上有 Jena 模型,需要通过套接字保持同步。 我需要解决的主要问题是模型可能包含匿名节点(bnode),它们可以源自任何模型。
问题:我是否走在正确的轨道上,或者是否有我没有考虑的更好、更稳健的方法?
我可以想到解决这个问题的 3 种方法:
- 序列化完整模型:这对于同步小更新来说成本过高。 另外,由于任何一台机器上都可能发生更改,因此我不能只用机器 A 的序列化模型替换机器 B 的模型。我需要合并它们。
- 序列化部分模型:使用专用模型进行序列化,仅包含需要通过套接字发送的更改。 这种方法需要特殊的词汇来表示从模型中删除的语句。 据推测,当我将模型从机器 A 序列化到机器 B 时,匿名节点 ID 对于机器 A 来说是唯一的,但可能与机器 B 上创建的匿名节点的 ID 重叠。因此,我必须重命名匿名节点并保留映射从机器 A 的匿名 ID 到机器 B 的 ID,以便正确处理未来的更改。
- 序列化单个语句:这种方法不需要特殊的词汇,但可能不那么强大。 除了匿名节点之外,是否还有我尚未遇到的问题?
- 生成全局唯一的 bnode id(新):我们可以通过在 ID 前添加唯一的机器 ID 来为匿名节点生成全局唯一的 ID。 不幸的是,我还没有弄清楚如何告诉 Jena 使用我的 ID 生成器而不是它的自己的。 这将使我们能够序列化各个语句,而无需重新映射 bnode ID。
这里有一个例子来进一步说明这个讨论。 假设我在机器 A 上有一个列表,表示为:
_:a rdf:first myns:tom
_:a rdf:rest rdf:nil
我将该模型从机器 A 序列化到机器 B。现在,因为机器 B 可能已经有一个 id 为“a”的(不相关的)匿名节点,所以我将 id “a”重新映射到 a新的 id 'b':
_:b rdf:first myns:tom
_:b rdf:rest rdf:nil
现在机器 A 上的列表发生了变化:
_:a rdf:first myns:tom
_:a rdf:rest _:b
_:b rdf:first myns:dick
_:b rdf:rest rdf:nil
由于机器 B 之前从未遇到过机器 A 的 id 'b',因此它添加了从机器 A 的 id 'b' 到新 id 'c' 的新映射:
_:b rdf:first myns:tom
_:b rdf:rest _:c
_:c rdf:first myns:dick
_:c rdf:rest rdf:nil
问题如果有两台以上的机器,情况会更加复杂。 例如,如果有第三台机器 C,它可能有自己的匿名节点“a”,该匿名节点与机器 A 的匿名节点“a”不同。 因此,机器 B 确实需要保存从每台其他机器的匿名节点 ID 到其本地 ID 的映射,而不仅仅是从一般远程 ID 到本地 ID 的映射。 处理传入更改时,必须考虑更改的来源以正确映射 ID。
This question relates to rcreswick's question on Serializing Jena OntModel Changes. I have Jena models on two (or more) machines that need to remain synchronized over sockets. The main issue that I need to address is that the models may contain anonymous nodes (bnodes), which can originate in any of the models.
Question: Am I on the right track here, or is there a better, more robust approach that I'm failing to consider?
I can think of 3 approaches to this problem:
- Serialize the complete model: This is prohibitively expensive for synchronizing small updates. Also, since changes can occur on either machine, I can't just replace machine B's model with the serialized model from machine A. I need to merge them.
- Serialize a partial model: Use a dedicated model for serialization that only contains the changes that need to be sent over the socket. This approach requires special vocabulary to represent statements that were removed from the model. Presumably, when I serialize the model from machine A to machine B, anonymous node IDs will be unique to machine A but may overlap with IDs for anonymous nodes created on machine B. Therefore, I'll have to rename anonymous nodes and keep a mapping from machine A's anon ids to machine B's ids in order to handle future changes correctly.
- Serialize individual statements: This approach requires no special vocabulary, but may not be as robust. Are there issues other than anonymous nodes that I just haven't encountered yet?
- Generate globally unique bnode ids (NEW): We can generate globally unique IDs for anonymous nodes by prefixing the ID with a unique machine ID. Unfortunately, I haven't figured out how to tell Jena to use my ID generator instead of its own. This would allow us to serialize individual statements without remapping bnode IDs.
Here's an example to ground this discussion a bit more. Suppose I have a list on machine A represented as:
_:a rdf:first myns:tom
_:a rdf:rest rdf:nil
I serialize this model from machine A to machine B. Now, because machine B may already have an (unrelated) anonymous node with id 'a', I remap id 'a' to a new id 'b':
_:b rdf:first myns:tom
_:b rdf:rest rdf:nil
Now the list changes on machine A:
_:a rdf:first myns:tom
_:a rdf:rest _:b
_:b rdf:first myns:dick
_:b rdf:rest rdf:nil
Since machine B has never encountered machine A's id 'b' before, it adds a new mapping from machine A's id 'b' to a new id 'c':
_:b rdf:first myns:tom
_:b rdf:rest _:c
_:c rdf:first myns:dick
_:c rdf:rest rdf:nil
The problem is further complicated with more than two machines. If there is a third machine C, for example, it may have it's own anonymous node 'a' that is different from machine A's anonymous node 'a'. Thus, machine B really does need to keep a map from each of the other machines' anonymous node IDs to its local IDs, not just from remote IDs in general to local IDs. When processing incoming changes, it must take into account where the changes came from to map the IDs correctly.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以将自己的三元组添加到模型中吗? 如果是这样,我将为每个 bnode 引入一个语句,以 URN 的形式为每个 bnode 提供一个备用公共 id。 您现在可以开始匹配两个模型之间的 bnode。
不过,无论是否为空白节点,双向同步只能让您到目前为止。 如果您试图检测两个模型上的等效并发变化,那么这样的策略只能让您到目前为止。
这是一个例子。 假设您正在创办一家新的草坪护理公司。 为了招揽生意,您和您的伴侣去参加当地的户外活动,并尝试预订一些打折的试用预约。 你们两个人各拿着一台笔记本电脑,交流并记录任何感兴趣的人。 记录有:
假设每条记录都作为资源存储在模型中。 你可能会遇到丈夫,而你的伴侣可能会遇到同家的妻子。 无论您是否巧合地预订了相同的约会日期时间,系统都很难删除重复的条目。 无论您为每条记录使用 bnode 还是基于 UUID 的 URI,都不会进行重复数据删除。 唯一的希望是,如果您使用某种规范形式的电话号码来合成确定性 URI 进行记录。
Are you allowed to add your own triples to the model? If so, I would introduce a statement for every bnode, giving each an alternate public id in the form of a URN. You can now start matching bnodes between the two models.
Blank nodes or not, though, the two-way sync will only get you so far. If you are trying to detect equivalent concurrent changes on both models, strategies like this will only get you so far.
Here's an example. Let's say you are starting a new lawn care company. In order to drum up some business, you and your partner go to a local outdoor event, and try to book some discounted trial appointments. The two of you, each armed with a laptop, mingle and record anyone interested. The record is has:
Let's say each record is stored as a resource in your model. It is possible for you to meet the husband, and your partner to meet the wife of the same household. Whether you coincidentally book the same appointment dateTime or not, the system would be hard-pressed to de-duplicate the entry. Whether you use a bnode for each record or a UUID based URI, it would not de-dup. The only hope is if you use say the phone number in some canonical form to synthesis a deterministic URI for the record.