Oracle:将 varchar2 字段中的 xml 实体转换为 utf-8 字符

发布于 2025-01-01 17:15:45 字数 437 浏览 0 评论 0原文

我在表中有一个字段,它保存特殊字符的 XML 实体,因为该表采用 latin-1 格式。 例如“Hallöle slovenčina”(“ö”在 latin-1 中,但“slovenčina”中的“č”必须通过某个存储该实体的应用程序转换为实体值到数据库中)

现在我需要通过将 XML 实体转换为其原始字符来将表导出到 utf-8 编码文件中。

Oracle 中是否有一个函数可以为我处理这个问题,或者我真的需要为此创建一个巨大的键/值映射?

非常感谢任何帮助。

编辑:我找到了函数 DBMS_XMLGEN.convert,但它仅适用于 <>& >。不在 &#NNN; 上:-(

I have a field in a table which holds XML entities for special characters, since the table is in latin-1.
E.g. "Hallöle slovenčina" (the "ö" is in latin-1, but the "č" in "slovenčina" had to be converted to an entity by some application that stores the values into the database)

Now I need to export the table into a utf-8 encoded file by converting the XML entities to their original characters.

Is there a function in Oracle that might handle this for me, or do I really need to create a huge key/value map for that?

Any help is greatly appreciated.

EDIT: I found the function DBMS_XMLGEN.convert, but it only works on <,> and &. Not on &#NNN; :-(

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

梨涡 2025-01-08 17:15:46

我认为 dbms_xmlgen 的问题在于技术上只有五个 XML 实体。您的示例有一个数字 HTML 实体,它与 Unicode 相对应:

http: //theorem.ca/~mvcorks/cgi-bin/unicode.pl.cgi?start=0100&end=017F

Oracle 有一个函数 UNISTR 在这里很有帮助:

select unistr('sloven\010dina') from dual;

在上面的示例中,我已将 269 转换为其十六进制等效值 010d(在 Unicode 中为 U+010D)。但是,您可以传递一个十进制数并进行如下转换:

select unistr('sloven\' || replace(to_char(269, 'xxx'), ' ', '0') || 'ina') from dual;

编辑:PL/SQL 解决方案:

这是我为您准备的一个示例。这应该循环并替换您从表中选择的每一行的所有出现情况。

create table html_entities (
    id NUMBER(3),
    text_row VARCHAR2(100)
);

INSERT INTO html_entities 
VALUES (1, 'Hallöle slovenčina Ċ ú');

INSERT INTO html_entities 
VALUES (2, 'I like the letter Ċ');

INSERT INTO html_entities 
VALUES (3, 'Nothing to change here.');

DECLARE
    v_replace_str NVARCHAR2(1000);
    v_fh UTL_FILE.FILE_TYPE;       
BEGIN
    --v_fh := utl_file.fopen_nchar(LOCATION IN VARCHAR2, FILENAME IN VARCHAR2, OPEN_MODE IN VARCHAR2, MAX_LINESIZE IN BINARY_INTEGER);

    FOR v_rec IN (select id, text_row from html_entities) LOOP
        v_replace_str := v_rec.text_row;
        WHILE (REGEXP_INSTR(v_replace_str, '&#[0-9]+;') <> 0) LOOP
            v_replace_str := REGEXP_REPLACE(
                v_replace_str, 
                '&#([0-9]+);',
                unistr('\' || replace(to_char(to_number(regexp_replace(v_replace_str, '.*?&#([0-9]+);.*

请注意,我已对 UTL_FILE 函数的调用进行了存根,以将 NVARCHAR 行(Oracle 的扩展字符集)写入数据库服务器上的文件。 dbms_output 虽然非常适合调试,但似乎不支持扩展字符,但如果您使用 UTL_FILE 写入文件,这应该不是问题。这是 DBMS_OUTPUT:

Hallöle slovencina C ú
I like the letter C
Nothing to change here.
, '\1')), 'xxx'), ' ', '0')), 1, 1 ); END LOOP; -- utl_file.put_line_nchar(v_fh, v_replace_str); dbms_output.put_line(v_replace_str); END LOOP; --utl_file.fclose(v_fh); END; /

请注意,我已对 UTL_FILE 函数的调用进行了存根,以将 NVARCHAR 行(Oracle 的扩展字符集)写入数据库服务器上的文件。 dbms_output 虽然非常适合调试,但似乎不支持扩展字符,但如果您使用 UTL_FILE 写入文件,这应该不是问题。这是 DBMS_OUTPUT:

I believe the problem with dbms_xmlgen is that there are technically only five XML entities. Your example has a numeric HTML entity, which corresponds with Unicode:

http://theorem.ca/~mvcorks/cgi-bin/unicode.pl.cgi?start=0100&end=017F

Oracle has a function UNISTR which is helpful here:

select unistr('sloven\010dina') from dual;

I've converted 269 to its hex equivalent 010d in the example above (in Unicode it is U+010D). However, you could pass a decimal number and do a conversion like this:

select unistr('sloven\' || replace(to_char(269, 'xxx'), ' ', '0') || 'ina') from dual;

EDIT: The PL/SQL solution:

Here's an example I've rigged up for you. This should loop over and replace any occurrences for each row you select out of your table(s).

create table html_entities (
    id NUMBER(3),
    text_row VARCHAR2(100)
);

INSERT INTO html_entities 
VALUES (1, 'Hallöle slovenčina Ċ ú');

INSERT INTO html_entities 
VALUES (2, 'I like the letter Ċ');

INSERT INTO html_entities 
VALUES (3, 'Nothing to change here.');

DECLARE
    v_replace_str NVARCHAR2(1000);
    v_fh UTL_FILE.FILE_TYPE;       
BEGIN
    --v_fh := utl_file.fopen_nchar(LOCATION IN VARCHAR2, FILENAME IN VARCHAR2, OPEN_MODE IN VARCHAR2, MAX_LINESIZE IN BINARY_INTEGER);

    FOR v_rec IN (select id, text_row from html_entities) LOOP
        v_replace_str := v_rec.text_row;
        WHILE (REGEXP_INSTR(v_replace_str, '&#[0-9]+;') <> 0) LOOP
            v_replace_str := REGEXP_REPLACE(
                v_replace_str, 
                '&#([0-9]+);',
                unistr('\' || replace(to_char(to_number(regexp_replace(v_replace_str, '.*?&#([0-9]+);.*

Notice that I've stubbed in calls to the UTL_FILE function to write NVARCHAR lines (Oracle's extended character set) to a file on the database server. The dbms_output, while great for debugging, doesn't seem to support extended characters, but this shouldn't be a problem if you use UTL_FILE to write to a file. Here's the DBMS_OUTPUT:

Hallöle slovencina C ú
I like the letter C
Nothing to change here.
, '\1')), 'xxx'), ' ', '0')), 1, 1 ); END LOOP; -- utl_file.put_line_nchar(v_fh, v_replace_str); dbms_output.put_line(v_replace_str); END LOOP; --utl_file.fclose(v_fh); END; /

Notice that I've stubbed in calls to the UTL_FILE function to write NVARCHAR lines (Oracle's extended character set) to a file on the database server. The dbms_output, while great for debugging, doesn't seem to support extended characters, but this shouldn't be a problem if you use UTL_FILE to write to a file. Here's the DBMS_OUTPUT:

哽咽笑 2025-01-08 17:15:46

您也可以只使用国际化包:

UTL_I18N.unescape_reference('文本')

非常适合将这些 html 实体更改为普通字符(例如将数据库从 iso 8859P1 移动到 UTF-8 后的清理)

You can also just use the internationalization package :

UTL_I18N.unescape_reference ('text')

Works great in changing those html entities to normal characters (such as cleanup after moving a database from iso 8859P1 to UTF-8)

九歌凝 2025-01-08 17:15:46

这可能应该在 PL/SQL 中完成,我不知道,但我想看看用纯 SQL 能达到什么程度。这只会替换第一次出现的代码,因此您必须以某种方式运行它多次。

select regexp_replace(s, '&#([0-9]+);', u) from
(select s, unistr('\0' || REPLACE(TO_CHAR(TO_NUMBER(c), 'xxxx'), ' ', '')) u from
(select s, regexp_replace(s, '.*&#([0-9]+);.*', '\1') c from
(select 'Hallöle slovenčina' s from dual)))

或者可读性较差但更可用:

SELECT 
REGEXP_REPLACE(s, '&#([0-9]+);', unistr('\0' || REPLACE(TO_CHAR(TO_NUMBER(regexp_replace(s, '.*?&#([0-9]+);.*

此(更新的)版本正确替换了第一次出现的版本。您需要应用它,直到所有这些都被替换为止。

, '\1', 1, 1)), 'xxxx'), ' ', '')), 1, 1) FROM (SELECT 'Hallöle slovenčina č Ė' s FROM DUAL)

此(更新的)版本正确替换了第一次出现的版本。您需要应用它,直到所有这些都被替换为止。

This should probably be done in PL/SQL which I do not know, but I wanted to see how far I could get it with pure SQL. This only replaces the first occurence of the code, so you would have to somehow run it multiple times.

select regexp_replace(s, '&#([0-9]+);', u) from
(select s, unistr('\0' || REPLACE(TO_CHAR(TO_NUMBER(c), 'xxxx'), ' ', '')) u from
(select s, regexp_replace(s, '.*&#([0-9]+);.*', '\1') c from
(select 'Hallöle slovenčina' s from dual)))

Or less readable but more usable:

SELECT 
REGEXP_REPLACE(s, '&#([0-9]+);', unistr('\0' || REPLACE(TO_CHAR(TO_NUMBER(regexp_replace(s, '.*?&#([0-9]+);.*

This (updated) version correctly replaces the first occurrence. You need to apply it until all of them are replaced.

, '\1', 1, 1)), 'xxxx'), ' ', '')), 1, 1) FROM (SELECT 'Hallöle slovenčina č Ė' s FROM DUAL)

This (updated) version correctly replaces the first occurrence. You need to apply it until all of them are replaced.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文