将 XML 导入 PostgreSQL 数据库或转换 XML -> CSV-> PostgreSQL

发布于 2024-10-31 16:56:31 字数 1020 浏览 0 评论 0原文

我有 XML 文件

<Cluster clsId="UNIPR_NIRI_PARDP" semType="geneProt"> <Entry entryId="UNIPR_NIRI_PARDP_1" baseForm="Protein nirI" type="PREFERRED">

<Variant WRITTENFORM="FMN-binding domain protein" type="orthographic"/> <Variant WRITTENFORM="FMN-binding domain-containing protein" type="orthographic"/> <Variant WRITTENFORM="unknown" type="orthographic"/> <Variant WRITTENFORM="FMN-binding" type="orthographic"/> <Variant WRITTENFORM="Pden_2486" type="orthographic"/> <Variant WRITTENFORM="nirI" type="orthographic"/> <SourceDC sourceName="BioThesaurus" sourceId="Q51699"/> <PosDC posName="POS" pos="N"/> <DC att="uniprot_ac" val="Q51699"/> <DC att="speciesNameNCBI" val="318586"/>

</Entry> </Cluster>

，我需要将此内容导入到 postgresql 中。请在这方面帮助我，要么直接执行程序，要么将 XML 转换为 csv 到 PostgreSQL。

等列的表格，

我需要包含clsid、entryid、semType、baseForm、variant(writingform)、variant(type)、dc(att)、dc(val)

提前谢谢您。

原文

I have XML file

<Cluster clsId="UNIPR_NIRI_PARDP" semType="geneProt"> <Entry entryId="UNIPR_NIRI_PARDP_1" baseForm="Protein nirI" type="PREFERRED">

<Variant WRITTENFORM="FMN-binding domain protein" type="orthographic"/> <Variant WRITTENFORM="FMN-binding domain-containing protein" type="orthographic"/> <Variant WRITTENFORM="unknown" type="orthographic"/> <Variant WRITTENFORM="FMN-binding" type="orthographic"/> <Variant WRITTENFORM="Pden_2486" type="orthographic"/> <Variant WRITTENFORM="nirI" type="orthographic"/> <SourceDC sourceName="BioThesaurus" sourceId="Q51699"/> <PosDC posName="POS" pos="N"/> <DC att="uniprot_ac" val="Q51699"/> <DC att="speciesNameNCBI" val="318586"/>

</Entry> </Cluster>

I need to import this content into postgresql. Kindly help me in this regard either direct procedure or convert XML to csv to PostgreSQL.

I need the tables with columns like

clsid, entryid, semType, baseForm, variant(writtenform), variant(type), dc(att), dc(val)

Thank you in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苏佲洛 2024-11-07 16:56:31

首先，解析您的 xml 文件以获取包含您需要的所有信息的文件。

例如，如果您只想拥有一张具有属性 clsid、entryid、semType、baseForm、variant(writingform)、variant(type)、dc(att)、dc(val) 的表，那么您只需要一个包含这些属性的文件属性（用一些字符分隔）。文件中的每一行都对应于表中的每一行。

接下来，您在 Postgresql 中创建表架构。然后使用 Postgresql 的 COPY 命令，它将所有数据从文件复制到表中。

请注意，如果您的 xml 文件很大，您应该使用基于事件的解析器。例如 Java 中的 SAX、StAX。

编辑
*注意*：使用的库：stax2-api-3.1.1.jar、woodstox-core-asl-4.1.1jar
这是代码（希望它能满足您的需要，如果没有，我相信它可以帮助您开始）：

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package test;

import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.net.MalformedURLException;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import java.util.ArrayList;
import org.codehaus.stax2.XMLInputFactory2;
import org.codehaus.stax2.XMLStreamReader2;

public class Main {

    /**
     * @param args the command line arguments
     */

    /*
     * dc(att), dc(val)
     */
    @SuppressWarnings("CallToThreadDumpStack")
    public static void main(String[] args) throws MalformedURLException, IOException, XMLStreamException {
        FileInputStream fstream = new FileInputStream(args[0]);
        Reader in = new InputStreamReader(fstream, "UTF-8");
        XMLInputFactory2 factory = (XMLInputFactory2) XMLInputFactory.newInstance();
        XMLStreamReader2 parser = (XMLStreamReader2) factory.createXMLStreamReader(in);

        FileOutputStream outStream = new FileOutputStream("/home/aseke/Desktop/out.txt");
        BufferedWriter out = new BufferedWriter(new OutputStreamWriter(outStream, "UTF-8"));


        boolean isCluster = false;
        ArrayList<String> dc = new ArrayList<String>();
        ArrayList<String> variants = new ArrayList<String>();

        /* You actually do not need all of these variables, it's just for clarity */
        String clsID = null;
        String semType = null;
        String varWritten = null;
        String varType = null;
        String entryID = null;
        String baseForm = null;
        String dcAtt = null;
        String dcVal = null;
        String s = null;
        while (true) {
            int event = parser.next();
            if (event == XMLStreamConstants.END_DOCUMENT) {
                parser.close();
                break;
            }

            if (event == XMLStreamConstants.START_ELEMENT) {
                String tag = parser.getLocalName();

                if (tag.equals("Cluster")) {
                    isCluster = true;
                    clsID = parser.getAttributeValue(0);
                    semType = parser.getAttributeValue(1);
                } else if (tag.equals("Entry") && isCluster) {
                    entryID = parser.getAttributeValue(0);
                    baseForm = parser.getAttributeValue(1);
                } else if (tag.equals("Variant") && isCluster) {

                    varWritten = parser.getAttributeValue(0);
                    varType = parser.getAttributeValue(1);

                    variants.add(varWritten + "~" + varType);
                } else if (tag.equals("DC") && isCluster) {
                    dcAtt = parser.getAttributeValue(0);
                    dcVal = parser.getAttributeValue(1);

                    dc.add(dcAtt + "~" + dcVal);
                }
            }

            if (event == XMLStreamConstants.END_ELEMENT && isCluster) {
                if (parser.getLocalName().equals("Cluster")) {
                    isCluster = false;
                    //clsid, entryid, semType, baseForm, variant(writtenform), variant(type), dc(att), dc(val)
                    // Use tabs as delimiter for Postgre COPY
                    String outStr = clsID + "/t" + entryID + "/t" + semType + "/t" + baseForm + "/t";

                    /* Add all variants */
                    for (String var : variants) {
                        String tmp[] = var.split("~");
                        varWritten = tmp[0];
                        varType = tmp[1];
                        outStr += varWritten + "/t" + varType + "/t";
                    }
                    /* Add al DCs */
                    for (String ss : dc) {
                        String[] tmp = ss.split("~");
                        dcAtt = tmp[0];
                        dcVal = tmp[1];
                        outStr += dcAtt + "/t" + dcVal + "/t";
                    }
                    // remove last tab "\t"
                    outStr = outStr.substring(0, outStr.length() - 2);
                    out.write(outStr);
                    variants.clear();
                    dc.clear();

                }
            }
        }

        // close all streams
        fstream.close();
        out.close();
        outStream.close();
    }
}

我格式化了您输入的 xml。所以输入文件看起来像这样：

<Cluster clsId="UNIPR_NIRI_PARDP" semType="geneProt">
    <Entry entryId="UNIPR_NIRI_PARDP_1" baseForm="Protein nirI" type="PREFERRED">
        <Variant WRITTENFORM="FMN-binding domain protein" type="orthographic"/>
        <Variant WRITTENFORM="FMN-binding domain-containing protein" type="orthographic"/>
        <Variant WRITTENFORM="unknown" type="orthographic"/>
        <Variant WRITTENFORM="FMN-binding" type="orthographic"/>
        <Variant WRITTENFORM="Pden_2486" type="orthographic"/>
        <Variant WRITTENFORM="nirI" type="orthographic"/>
        <SourceDC sourceName="BioThesaurus" sourceId="Q51699"/>
        <PosDC posName="POS" pos="N"/>
        <DC att="uniprot_ac" val="Q51699"/>
        <DC att="speciesNameNCBI" val="318586"/>
    </Entry>
</Cluster>

输出看起来像这样。请注意，它是用制表符分隔的。制表符稍后将在 Postgre COPY 命令中用作分隔符。您可以将分隔符更改为任何其他分隔符。

UNIPR_NIRI_PARDP/tUNIPR_NIRI_PARDP_1/tgeneProt/tProtein nirI/tFMN-binding domain protein/torthographic/tFMN-binding domain-containing protein/torthographic/tunknown/torthographic/tFMN-binding/torthographic/tPden_2486/torthographic/tnirI/torthographic/tuniprot_ac/tQ51699/tspeciesNameNCBI/t318586

First, parse your xml file to obtain a file(s) that contain all of the info you need.

For example, if you want to have just one table with the attributes clsid, entryid, semType, baseForm, variant(writtenform), variant(type), dc(att), dc(val) then you just need one file that has these attributes (separated with some character). Each line in file would correspond to each row in table.

Next, you create table schema in Postgresql. Then use Postgresql's COPY command, which copies all of the data from file to table.

Note that if your xml file is huge you should use event based parser. Something like SAX, StAX in Java for example.

EDIT
*NOTE*: libraries used: stax2-api-3.1.1.jar, woodstox-core-asl-4.1.1jar
Here is the code (hopefully it does what you need, if not I am sure it gets you started):

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package test;

import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Reader;
import java.net.MalformedURLException;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamException;
import java.util.ArrayList;
import org.codehaus.stax2.XMLInputFactory2;
import org.codehaus.stax2.XMLStreamReader2;

public class Main {

    /**
     * @param args the command line arguments
     */

    /*
     * dc(att), dc(val)
     */
    @SuppressWarnings("CallToThreadDumpStack")
    public static void main(String[] args) throws MalformedURLException, IOException, XMLStreamException {
        FileInputStream fstream = new FileInputStream(args[0]);
        Reader in = new InputStreamReader(fstream, "UTF-8");
        XMLInputFactory2 factory = (XMLInputFactory2) XMLInputFactory.newInstance();
        XMLStreamReader2 parser = (XMLStreamReader2) factory.createXMLStreamReader(in);

        FileOutputStream outStream = new FileOutputStream("/home/aseke/Desktop/out.txt");
        BufferedWriter out = new BufferedWriter(new OutputStreamWriter(outStream, "UTF-8"));


        boolean isCluster = false;
        ArrayList<String> dc = new ArrayList<String>();
        ArrayList<String> variants = new ArrayList<String>();

        /* You actually do not need all of these variables, it's just for clarity */
        String clsID = null;
        String semType = null;
        String varWritten = null;
        String varType = null;
        String entryID = null;
        String baseForm = null;
        String dcAtt = null;
        String dcVal = null;
        String s = null;
        while (true) {
            int event = parser.next();
            if (event == XMLStreamConstants.END_DOCUMENT) {
                parser.close();
                break;
            }

            if (event == XMLStreamConstants.START_ELEMENT) {
                String tag = parser.getLocalName();

                if (tag.equals("Cluster")) {
                    isCluster = true;
                    clsID = parser.getAttributeValue(0);
                    semType = parser.getAttributeValue(1);
                } else if (tag.equals("Entry") && isCluster) {
                    entryID = parser.getAttributeValue(0);
                    baseForm = parser.getAttributeValue(1);
                } else if (tag.equals("Variant") && isCluster) {

                    varWritten = parser.getAttributeValue(0);
                    varType = parser.getAttributeValue(1);

                    variants.add(varWritten + "~" + varType);
                } else if (tag.equals("DC") && isCluster) {
                    dcAtt = parser.getAttributeValue(0);
                    dcVal = parser.getAttributeValue(1);

                    dc.add(dcAtt + "~" + dcVal);
                }
            }

            if (event == XMLStreamConstants.END_ELEMENT && isCluster) {
                if (parser.getLocalName().equals("Cluster")) {
                    isCluster = false;
                    //clsid, entryid, semType, baseForm, variant(writtenform), variant(type), dc(att), dc(val)
                    // Use tabs as delimiter for Postgre COPY
                    String outStr = clsID + "/t" + entryID + "/t" + semType + "/t" + baseForm + "/t";

                    /* Add all variants */
                    for (String var : variants) {
                        String tmp[] = var.split("~");
                        varWritten = tmp[0];
                        varType = tmp[1];
                        outStr += varWritten + "/t" + varType + "/t";
                    }
                    /* Add al DCs */
                    for (String ss : dc) {
                        String[] tmp = ss.split("~");
                        dcAtt = tmp[0];
                        dcVal = tmp[1];
                        outStr += dcAtt + "/t" + dcVal + "/t";
                    }
                    // remove last tab "\t"
                    outStr = outStr.substring(0, outStr.length() - 2);
                    out.write(outStr);
                    variants.clear();
                    dc.clear();

                }
            }
        }

        // close all streams
        fstream.close();
        out.close();
        outStream.close();
    }
}

I formatted you input xml. So input file looks like this:

<Cluster clsId="UNIPR_NIRI_PARDP" semType="geneProt">
    <Entry entryId="UNIPR_NIRI_PARDP_1" baseForm="Protein nirI" type="PREFERRED">
        <Variant WRITTENFORM="FMN-binding domain protein" type="orthographic"/>
        <Variant WRITTENFORM="FMN-binding domain-containing protein" type="orthographic"/>
        <Variant WRITTENFORM="unknown" type="orthographic"/>
        <Variant WRITTENFORM="FMN-binding" type="orthographic"/>
        <Variant WRITTENFORM="Pden_2486" type="orthographic"/>
        <Variant WRITTENFORM="nirI" type="orthographic"/>
        <SourceDC sourceName="BioThesaurus" sourceId="Q51699"/>
        <PosDC posName="POS" pos="N"/>
        <DC att="uniprot_ac" val="Q51699"/>
        <DC att="speciesNameNCBI" val="318586"/>
    </Entry>
</Cluster>

Output looks like this. Note that it is delimited with tabs. Tabs will be later used as a delimiter in Postgre COPY command. You can change delimiter to any other.

UNIPR_NIRI_PARDP/tUNIPR_NIRI_PARDP_1/tgeneProt/tProtein nirI/tFMN-binding domain protein/torthographic/tFMN-binding domain-containing protein/torthographic/tunknown/torthographic/tFMN-binding/torthographic/tPden_2486/torthographic/tnirI/torthographic/tuniprot_ac/tQ51699/tspeciesNameNCBI/t318586

回复收藏 0 原文

贵在坚持 2024-11-07 16:56:31

我在 noko-giri 和 open-uri 包的帮助下使用 Ruby 完成了此操作。因为我的输入文件太大了。许多解析器都失败了，noko-giri 在这方面提供了帮助。

我用三列提供了答案：baseForm-variant(writingform)-dc(val)。该信息可能是该问题的清晰信息。

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(File.open("xai"))
ent = doc.xpath("//Entry")

value = String.new
ent.each do |e| 
    d = e.xpath("DC")   
    d.each do |f|       
        if f.attributes["att"].to_s =~ /uniprot_ac/
            value = f.attributes["val"].to_s
        end
    end
    f = e.xpath("Variant")  
    f.each do |g|
        puts "#{e.attributes["baseForm"].to_s}\t" + "#{g.attributes["WRITTENFORM"].to_s}\t" + "#{value}"
    end 
end

I did this using Ruby with help of packages noko-giri and open-uri. As, my input file is so large in size. Many of the parser got failed and noko-giri helped in this.

I provided the answer with three columns, baseForm-variant(writtenform)-dc(val). This information may be a crisp information for the question.

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(File.open("xai"))
ent = doc.xpath("//Entry")

value = String.new
ent.each do |e| 
    d = e.xpath("DC")   
    d.each do |f|       
        if f.attributes["att"].to_s =~ /uniprot_ac/
            value = f.attributes["val"].to_s
        end
    end
    f = e.xpath("Variant")  
    f.each do |g|
        puts "#{e.attributes["baseForm"].to_s}\t" + "#{g.attributes["WRITTENFORM"].to_s}\t" + "#{value}"
    end 
end

回复收藏 0 原文

~没有更多了~