将元数据存储到 Jackrabbit 存储库中

发布于 2024-10-19 12:00:48 字数 461 浏览 2 评论 0原文

谁能向我解释一下,在以下情况下如何进行?

  1. 接收文档(MS 文档、ODS、PDF)

  2. 通过 Apache Tika 进行公共核心元数据提取 + 通过 jackrabbit-content-extractors 进行内容提取

  3. 使用 Jackrabbit 将文档(内容)与其元数据一起存储到存储库中

  4. 检索文档 + 元数据

我对第 3 点和第 4 点感兴趣...

详细信息: 该应用程序以交互方式处理文档(一些分析 - 语言检测、字数统计等 + 收集尽可能多的细节 - 都柏林核心 + 解析内容/事件处理),以便将处理结果返回给用户,然后将提取的内容返回给用户和元数据(提取的和自定义的用户元数据)存储到 JCR 存储库

感谢任何帮助,谢谢

can anybody explain to me, how to proceed in following scenario ?

  1. receiving documents (MS docs, ODS, PDF)

  2. Dublic core metadata extraction via Apache Tika + content extraction via jackrabbit-content-extractors

  3. using Jackrabbit to store documents (content) into repository together with their metadata ?

  4. retrieving documents + metadata

I'm interested in points 3 and 4 ...

DETAILS:
The application is processing documents interactively (some analysis - language detection, word count etc. + gather as many details possible - Dublin core + parsing the content/events handling) so that it returns results of the processing to the user and then the extracted content and metadata(extracted and custom user metadata) stores into JCR repository

Appreciate any helps, thank you

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

情释 2024-10-26 12:00:48

JCR 2.0 的上传文件与 JCR 1.0 基本相同。然而,JCR 2.0 添加了一些额外的有用的内置属性定义。

“nt:file”节点类型旨在表示一个文件,并且在 JCR 2.0 中具有两个内置属性定义(这两个定义都是在创建节点时由存储库自动创建的):

  • jcr:created (DATE)
  • jcr: createBy (STRING)

并定义一个名为“jcr:content”的子项。这个“jcr:content”节点可以是任何节点类型,但一般来说,与内容本身有关的所有信息都存储在这个子节点上。事实上的标准是使用“nt:resource”节点类型,它定义了以下属性:

  • jcr:data (BINARY) 强制
  • jcr:lastModified (DATE) 自动创建
  • jcr:lastModifiedBy (STRING) 自动创建
  • jcr:mimeType (STRING) 受保护?
  • jcr:编码(字符串)受保护吗?

请注意,JCR 2.0 中添加了“jcr:mimeType”和“jcr:encoding”。

特别是,“jcr:mimeType”属性的目的是完全按照您的要求进行操作 - 捕获内容的“类型”。但是,“jcr:mimeType”和“jcr:encoding”属性定义可以(由 JCR 实现)定义为受保护(意味着 JCR 实现自动设置它们) - 如果是这种情况,则不允许您手动设置这些属性。我相信 JackrabbitModeShape 不会将它们视为受保护。

下面是一些代码,展示了如何使用这些内置节点类型将文件上传到 JCR 2.0 存储库:

// Get an input stream for the file ...
File file = ...
InputStream stream = new BufferedInputStream(new FileInputStream(file));

Node folder = session.getNode("/absolute/path/to/folder/node");
Node file = folder.addNode("Article.pdf","nt:file");
Node content = file.addNode("jcr:content","nt:resource");
Binary binary = session.getValueFactory().createBinary(stream);
content.setProperty("jcr:data",binary);

如果 JCR 实现不将“jcr:mimeType”属性视为受保护(即 Jackrabbit 和 ModeShape),则您必须手动设置此属性:

content.setProperty("jcr:mimeType","application/pdf");

元数据可以很容易地存储在“nt:file”和“jcr:content”节点上,但开箱即用的“nt:file”和“nt:resource”节点类型不允许额外的属性。因此,在添加其他属性之前,首先需要添加一个 mixin(或多个 mixin),其中包含要存储的属性类型的属性定义。您甚至可以定义一个允许任何属性的 mixin。下面是定义此类 mixin 的 CND 文件:

<custom = 'http://example.com/mydomain'>
[custom:extensible] mixin
- * (undefined) multiple 
- * (undefined) 

注册此节点类型定义后,您可以在节点上使用它:

content.addMixin("custom:extensible");
content.setProperty("anyProp","some value");
content.setProperty("custom:otherProp","some other value");

您还可以定义并使用允许任何 Dublin Core 元素

<dc = 'http://purl.org/dc/elements/1.1/'>
[dc:metadata] mixin
- dc:contributor (STRING)
- dc:coverage (STRING)
- dc:creator (STRING)
- dc:date (DATE)
- dc:description (STRING)
- dc:format (STRING)
- dc:identifier (STRING)
- dc:language (STRING)
- dc:publisher (STRING)
- dc:relation (STRING)
- dc:right (STRING)
- dc:source (STRING)
- dc:subject (STRING)
- dc:title (STRING)
- dc:type (STRING)

所有这些属性都是可选的,并且此 mixin 不允许任何名称或类型的属性。我还没有真正解决这个“dc:元数据”混合问题,因为其中一些已经用内置属性表示(例如“jcr:createBy”,“jcr:lastModifiedBy”,“jcr:created” 、“jcr:lastModified”、“jcr:mimeType”),其中一些可能与内容更相关,而另一些则与文件更相关。

当然,您可以定义其他更适合您的元数据需求的 mixin,并在需要时使用继承。但是要小心地使用 mixins 的继承 - 因为 JCR 允许一个节点有多个 mixins,所以通常最好将 mixins 设计为严格范围和面向方面的(例如,“ex:taggable”、“ex:describable”等)然后根据需要将适当的 mixin 应用于节点。

(尽管要复杂得多,但甚至可以定义一个 mixin,允许在“nt:file”节点下有更多子项,并在那里存储一些元数据。)

Mixin 非常出色,为您的 JCR 提供了巨大的灵活性和强大功能内容。

哦,当您创建了所需的所有节点后,请务必保存会话:

session.save();

Uploading files is basically the same for JCR 2.0 as it is for JCR 1.0. However, JCR 2.0 adds a few additional built-in property definitions that are useful.

The "nt:file" node type is intended to represent a file and has two built-in property definitions in JCR 2.0 (both of which are auto-created by the repository when nodes are created):

  • jcr:created (DATE)
  • jcr:createdBy (STRING)

and defines a single child named "jcr:content". This "jcr:content" node can be of any node type, but generally speaking all information pertaining to the content itself is stored on this child node. The de facto standard is to use the "nt:resource" node type, which has these properties defined:

  • jcr:data (BINARY) mandatory
  • jcr:lastModified (DATE) autocreated
  • jcr:lastModifiedBy (STRING) autocreated
  • jcr:mimeType (STRING) protected?
  • jcr:encoding (STRING) protected?

Note that "jcr:mimeType" and "jcr:encoding" were added in JCR 2.0.

In particular, the purpose of the "jcr:mimeType" property was to do exactly what you're asking for - capture the "type" of the content. However, the "jcr:mimeType" and "jcr:encoding" property definitions can be defined (by the JCR implementation) as protected (meaning the JCR implementation automatically sets them) - if this is the case, you would not be allowed to manually set these properties. I believe that Jackrabbit and ModeShape do not treat these as protected.

Here is some code that shows how to upload a file into a JCR 2.0 repository using these built-in node types:

// Get an input stream for the file ...
File file = ...
InputStream stream = new BufferedInputStream(new FileInputStream(file));

Node folder = session.getNode("/absolute/path/to/folder/node");
Node file = folder.addNode("Article.pdf","nt:file");
Node content = file.addNode("jcr:content","nt:resource");
Binary binary = session.getValueFactory().createBinary(stream);
content.setProperty("jcr:data",binary);

And if the JCR implementation does not treat the "jcr:mimeType" property as protected (i.e., Jackrabbit and ModeShape), you'd have to set this property manually:

content.setProperty("jcr:mimeType","application/pdf");

Metadata can very easily be stored on the "nt:file" and "jcr:content" nodes, but out-of-the-box the "nt:file" and "nt:resource" node types don't allow for extra properties. So before you can add other properties, you first need to add a mixin (or multiple mixins) that have property definitions for the kinds of properties you want to store. You can even define a mixin that would allow any property. Here is a CND file defining such a mixin:

<custom = 'http://example.com/mydomain'>
[custom:extensible] mixin
- * (undefined) multiple 
- * (undefined) 

After registering this node type definition, you can then use this on your nodes:

content.addMixin("custom:extensible");
content.setProperty("anyProp","some value");
content.setProperty("custom:otherProp","some other value");

You could also define and use a mixin that allowed for any Dublin Core element:

<dc = 'http://purl.org/dc/elements/1.1/'>
[dc:metadata] mixin
- dc:contributor (STRING)
- dc:coverage (STRING)
- dc:creator (STRING)
- dc:date (DATE)
- dc:description (STRING)
- dc:format (STRING)
- dc:identifier (STRING)
- dc:language (STRING)
- dc:publisher (STRING)
- dc:relation (STRING)
- dc:right (STRING)
- dc:source (STRING)
- dc:subject (STRING)
- dc:title (STRING)
- dc:type (STRING)

All of these properties are optional, and this mixin doesn't allow for properties of any name or type. I've also not really addressed with this 'dc:metadata' mixin the fact that some of these are already represented with the built-in properties (e.g., "jcr:createBy", "jcr:lastModifiedBy", "jcr:created", "jcr:lastModified", "jcr:mimeType") and that some of them may be more related to content while others more related to the file.

You could of course define other mixins that better suit your metadata needs, using inheritance where needed. But be careful using inheritance with mixins - since JCR allows a node to multiple mixins, it's often best to design your mixins to be tightly scoped and facet-oriented (e.g., "ex:taggable", "ex:describable", etc.) and then simply apply the appropriate mixins to a node as needed.

(It's even possible, though much more complicated, to define a mixin that allows more children under the "nt:file" nodes, and to store some metadata there.)

Mixins are fantastic and give a tremendous amount of flexibility and power to your JCR content.

Oh, and when you've created all of the nodes you want, be sure to save the session:

session.save();
哭了丶谁疼 2024-10-26 12:00:48

我对 JCR 有点生疏,而且我从未使用过 2.0,但这应该可以帮助您入门。

请参阅此链接。您需要打开第二条评论。

您只需将文件存储在节点中并向该节点添加其他元数据即可。以下是如何存储文件:

Node folder = session.getRootNode().getNode("path/to/file/uploads"); 
Node file = folder.addNode(fileName, "nt:file"); 
Node fileContent = file.addNode("jcr:content"); 
fileContent.setProperty("jcr:data", fileStream);
// Add other metadata
session.save();

如何存储元数据取决于您。一种简单的方法是仅存储键值对:

fileContent.setProperty(key, value, PropertyType.STRING);

要读取数据,只需调用 getProperty() 即可。

fileStream = fileContent.getProperty("jcr:data");
value = fileContent.getProperty(key);

I am a bit rusty with JCR and I have never used 2.0 but this should get you started.

See this link. You'll want to open up the second comment.

You just store the file in a node and add additional metadata to the node. Here is how to store the file:

Node folder = session.getRootNode().getNode("path/to/file/uploads"); 
Node file = folder.addNode(fileName, "nt:file"); 
Node fileContent = file.addNode("jcr:content"); 
fileContent.setProperty("jcr:data", fileStream);
// Add other metadata
session.save();

How you store meta-data is up to you. A simple way is to just store key value pairs:

fileContent.setProperty(key, value, PropertyType.STRING);

To read the data you just call getProperty().

fileStream = fileContent.getProperty("jcr:data");
value = fileContent.getProperty(key);
花开雨落又逢春i 2024-10-26 12:00:48

我是 Jackrabbit 的新手,正在开发 2.4.2。
至于您的解决方案,您可以使用核心 java 逻辑检查类型,并放置定义操作中任何变化的案例。

您无需担心将不同 .txt 或 .pdf 的内容保存为其文件的问题
内容转换为二进制并保存。
这是一个小示例,我在 jackrabbit 存储库中上传和下载了 pdf 文件。

    // Import the pdf file unless already imported 
            // This program is for sample purpose only so everything is hard coded.
        if (!root.hasNode("Alfresco_E0_Training.pdf"))
        { 
            System.out.print("Importing PDF... "); 

            // Create an unstructured node under which to import the XML 
            //Node node = root.addNode("importxml", "nt:unstructured"); 
            Node file = root.addNode("Alfresco_E0_Training.pdf","nt:file");

            // Import the file "Alfresco_E0_Training.pdf" under the created node 
            FileInputStream stream = new FileInputStream("<path of file>\\Alfresco_E0_Training.pdf");
            Node content = file.addNode("jcr:content","nt:resource");
            Binary binary = session.getValueFactory().createBinary(stream);
            content.setProperty("jcr:data",binary);
            stream.close();
            session.save(); 
            //System.out.println("done."); 
            System.out.println("::::::::::::::::::::Checking content of the node:::::::::::::::::::::::::");
            System.out.println("File Node Name : "+file.getName());
            System.out.println("File Node Identifier : "+file.getIdentifier());
            System.out.println("File Node child : "+file.JCR_CHILD_NODE_DEFINITION);
            System.out.println("Content Node Name : "+content.getName());
            System.out.println("Content Node Identifier : "+content.getIdentifier());
            System.out.println("Content Node Content : "+content.getProperty("jcr:data"));
            System.out.println(":::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::");

        }else
        {
            session.save();
            Node file = root.getNode("Alfresco_E0_Training.pdf");
            Node content = file.getNode("jcr:content");
            String path = content.getPath();
            Binary bin = session.getNode(path).getProperty("jcr:data").getBinary();
            InputStream stream = bin.getStream();
             File f=new File("C:<path of the output file>\\Alfresco_E0_Training.pdf");

              OutputStream out=new FileOutputStream(f);
              byte buf[]=new byte[1024];
              int len;
              while((len=stream.read(buf))>0)
              out.write(buf,0,len);
              out.close();
              stream.close();
              System.out.println("\nFile is created...................................");


            System.out.println("done."); 
            System.out.println("::::::::::::::::::::Checking content of the node:::::::::::::::::::::::::");
            System.out.println("File Node Name : "+file.getName());
            System.out.println("File Node Identifier : "+file.getIdentifier());
            //System.out.println("File Node child : "+file.JCR_CHILD_NODE_DEFINITION);
            System.out.println("Content Node Name : "+content.getName());
            System.out.println("Content Node Identifier : "+content.getIdentifier());
            System.out.println("Content Node Content : "+content.getProperty("jcr:data"));
            System.out.println(":::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::");
        } 

        //output the repository content
        } 
    catch (IOException e){
        System.out.println("Exception: "+e);
    }
    finally { 
        session.logout(); 
        } 
        } 
}

希望这有帮助

I am new to Jackrabbit, working on 2.4.2.
As for your solution, you can check for the type using a core java logic and put cases defining any variation in your action.

You won't need to worry about issues with saving contents of different .txt or .pdf as their
content is converted into binary and saved.
Here is a small sample in which I uploaded and downloaded a pdf file in/from jackrabbit repo.

    // Import the pdf file unless already imported 
            // This program is for sample purpose only so everything is hard coded.
        if (!root.hasNode("Alfresco_E0_Training.pdf"))
        { 
            System.out.print("Importing PDF... "); 

            // Create an unstructured node under which to import the XML 
            //Node node = root.addNode("importxml", "nt:unstructured"); 
            Node file = root.addNode("Alfresco_E0_Training.pdf","nt:file");

            // Import the file "Alfresco_E0_Training.pdf" under the created node 
            FileInputStream stream = new FileInputStream("<path of file>\\Alfresco_E0_Training.pdf");
            Node content = file.addNode("jcr:content","nt:resource");
            Binary binary = session.getValueFactory().createBinary(stream);
            content.setProperty("jcr:data",binary);
            stream.close();
            session.save(); 
            //System.out.println("done."); 
            System.out.println("::::::::::::::::::::Checking content of the node:::::::::::::::::::::::::");
            System.out.println("File Node Name : "+file.getName());
            System.out.println("File Node Identifier : "+file.getIdentifier());
            System.out.println("File Node child : "+file.JCR_CHILD_NODE_DEFINITION);
            System.out.println("Content Node Name : "+content.getName());
            System.out.println("Content Node Identifier : "+content.getIdentifier());
            System.out.println("Content Node Content : "+content.getProperty("jcr:data"));
            System.out.println(":::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::");

        }else
        {
            session.save();
            Node file = root.getNode("Alfresco_E0_Training.pdf");
            Node content = file.getNode("jcr:content");
            String path = content.getPath();
            Binary bin = session.getNode(path).getProperty("jcr:data").getBinary();
            InputStream stream = bin.getStream();
             File f=new File("C:<path of the output file>\\Alfresco_E0_Training.pdf");

              OutputStream out=new FileOutputStream(f);
              byte buf[]=new byte[1024];
              int len;
              while((len=stream.read(buf))>0)
              out.write(buf,0,len);
              out.close();
              stream.close();
              System.out.println("\nFile is created...................................");


            System.out.println("done."); 
            System.out.println("::::::::::::::::::::Checking content of the node:::::::::::::::::::::::::");
            System.out.println("File Node Name : "+file.getName());
            System.out.println("File Node Identifier : "+file.getIdentifier());
            //System.out.println("File Node child : "+file.JCR_CHILD_NODE_DEFINITION);
            System.out.println("Content Node Name : "+content.getName());
            System.out.println("Content Node Identifier : "+content.getIdentifier());
            System.out.println("Content Node Content : "+content.getProperty("jcr:data"));
            System.out.println(":::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::");
        } 

        //output the repository content
        } 
    catch (IOException e){
        System.out.println("Exception: "+e);
    }
    finally { 
        session.logout(); 
        } 
        } 
}

Hope this helps

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文