转义非 ASCII 字符（或者如何删除 BOM？）

发布于 2024-08-22 05:04:18 字数 557 浏览 4 评论 0原文

我需要从输出为 JSON 和 YAML 的 Access 记录集创建一个 ANSI 文本文件。我可以写入文件，但输出是原始字符，我需要转义它们。例如，元音变音-O (ö) 应为“\u00f6”。

我认为将文件编码为 UTF-8 会起作用，但事实并非如此。但是，再次查看文件编码后，如果您编写“UTF-8 without BOM”，则一切正常。

有谁知道如何

a) 将文本写为不带 BOM 的 UTF-8，或者 b) 用 ANSI 编写但转义非 ASCII 字符？

Public Sub testoutput()

Set db = CurrentDb()

str_filename = "anothertest.json"
MyFile = CurrentProject.Path & "\" & str_filename
str_temp = "Hello world here is an ö"

fnum = FreeFile

Open MyFile For Output As fnum
Print #fnum, str_temp
Close #fnum

End Sub

原文

I need to create an ANSI text file from an Access recordset that outputs to JSON and YAML. I can write the file, but the output is coming out with the original characters, and I need to escape them. For example, an umlaut-O (ö) should be "\u00f6".

I thought encoding the file as UTF-8 would work, but it doesn't. However, having looked at the file coding again, if you write "UTF-8 without BOM" then everything works.

Does anyone know how to either

a) Write text out as UTF-8 without BOM, or
b) Write in ANSI but escaping the non-ASCII characters?

Public Sub testoutput()

Set db = CurrentDb()

str_filename = "anothertest.json"
MyFile = CurrentProject.Path & "\" & str_filename
str_temp = "Hello world here is an ö"

fnum = FreeFile

Open MyFile For Output As fnum
Print #fnum, str_temp
Close #fnum

End Sub

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

┼── 2024-08-29 05:04:18

...好吧...我找到了一些关于如何删除 BOM 的示例代码。我本以为在实际编写文本时可以更优雅地做到这一点。没关系。以下代码删除 BOM。

（本文最初由 Simon Pedersen 发布于 http:// www.imagemagick.org/discourse-server/viewtopic.php?f=8&t=12705）

' Removes the Byte Order Mark - BOM from a text file with UTF-8 encoding
' The BOM defines that the file was stored with an UTF-8 encoding.

Public Function RemoveBOM(filePath)

    ' Create a reader and a writer
            Dim writer, reader, fileSize
            Set writer = CreateObject("Adodb.Stream")
            Set reader = CreateObject("Adodb.Stream")

    ' Load from the text file we just wrote
            reader.Open
            reader.LoadFromFile filePath

    ' Copy all data from reader to writer, except the BOM
            writer.Mode = 3
            writer.Type = 1
            writer.Open
            reader.Position = 5
            reader.CopyTo writer, -1

    ' Overwrite file
            writer.SaveToFile filePath, 2

    ' Return file name
            RemoveBOM = filePath

    ' Kill objects
            Set writer = Nothing
            Set reader = Nothing
    End Function

它可能对其他人有用。

... ok .... i found some example code on how to remove the BOM. I would have thought it would be possible to do this more elegantly when actually writing the text in the first place. Never mind. The following code removes the BOM.

(This was originally posted by Simon Pedersen at http://www.imagemagick.org/discourse-server/viewtopic.php?f=8&t=12705)

' Removes the Byte Order Mark - BOM from a text file with UTF-8 encoding
' The BOM defines that the file was stored with an UTF-8 encoding.

Public Function RemoveBOM(filePath)

    ' Create a reader and a writer
            Dim writer, reader, fileSize
            Set writer = CreateObject("Adodb.Stream")
            Set reader = CreateObject("Adodb.Stream")

    ' Load from the text file we just wrote
            reader.Open
            reader.LoadFromFile filePath

    ' Copy all data from reader to writer, except the BOM
            writer.Mode = 3
            writer.Type = 1
            writer.Open
            reader.Position = 5
            reader.CopyTo writer, -1

    ' Overwrite file
            writer.SaveToFile filePath, 2

    ' Return file name
            RemoveBOM = filePath

    ' Kill objects
            Set writer = Nothing
            Set reader = Nothing
    End Function

It might be useful for someone else.

回复收藏 0 原文

鹤仙姿 2024-08-29 05:04:18

虽然我已经晚了，但我并不是唯一一个厌倦了我的 SQL 导入被带有字节顺序标记的文本文件破坏的编码员。很少有“堆栈问题”涉及到这个问题 - 这是最接近的问题之一 - 所以我在这里发布一个重叠的答案。

我说“重叠”是因为下面的代码解决的问题与您的问题略有不同 - 主要目的是为具有异构文件集合的文件夹编写架构文件 - 但 BOM 处理段已明确标记。

关键功能是我们迭代文件夹中的所有“.csv”文件，并使用前四个字节的快速半字节测试每个文件：并且仅在看到字节顺序标记时才删除字节顺序标记。

之后，我们将使用原始 C 语言的低级文件处理代码。我们必须一直使用字节数组，因为 您在 VBA 中执行的所有其他操作都会存储字节顺序标记嵌入到字符串变量的结构中。

因此，无需进一步使用 adodb，代码如下：

schema.ini 文件中文本文件的 BOM 处理代码：

Public Sub SetSchema(strFolder As String)
On Error Resume Next 

' Write a Schema.ini file to the data folder. 

' This is necessary if we do not have the registry privileges to set the
' correct 'ImportMixedTypes=Text' registry value, which overrides IMEX=1 

' The code also checks for ANSI or UTF-8 and UTF-16 files, and applies a
' usable setting for CharacterSet ( UNICODE|ANSI ) with a horrible hack. 

' OEM codepage-defined text is not supported: further coding is required 

' ...And we strip out Byte Order Markers, if we see them - the OLEDB SQL
' provider for textfiles can't deal with a BOM in a UTF-16 or UTF-8 file 

' Not implemented: handling tab-delimited files or other delimiters. The
' code assumes a header row with columns, specifies 'scan all rows', and
' imposes 'read the column as text' if the data types are mixed. 

Dim strSchema   As String
Dim strFile     As String
Dim hndFile     As Long
Dim arrFile()  As Byte
Dim arrBytes(0 To 4) As Byte 

If Right(strFolder, 1) <> "\" Then strFolder = strFolder & "\" 

' Dir() is an iterator function when you call it with a wildcard: 

strFile = VBA.FileSystem.Dir(strFolder & "*.csv") 

Do While Len(strFile) > 0 

    hndFile = FreeFile
    Open strFolder & strFile For Binary As #hndFile
    Get #hndFile, , arrBytes
    Close #hndFile 

    strSchema = strSchema & "[" & strFile & "]" & vbCrLf
    strSchema = strSchema & "Format=CSVDelimited" & vbCrLf
    strSchema = strSchema & "ImportMixedTypes=Text" & vbCrLf
    strSchema = strSchema & "MaxScanRows=0" & vbCrLf 

    If arrBytes(2) = 0 Or arrBytes(3) = 0 Then   ' this is a hack
        strSchema = strSchema & "CharacterSet=UNICODE" & vbCrLf
    Else
        strSchema = strSchema & "CharacterSet=ANSI" & vbCrLf
    End If 

    strSchema = strSchema & "ColNameHeader = True" & vbCrLf
    strSchema = strSchema & vbCrLf 


    ' BOM disposal - Byte order marks confuse OLEDB text drivers: 

    If arrBytes(0) = &HFE And arrBytes(1) = &HFF _
    Or arrBytes(0) = &HFF And arrBytes(1) = &HFE Then 

        hndFile = FreeFile
        Open strFolder & strFile For Binary As #hndFile
        ReDim arrFile(0 To LOF(hndFile) - 1)
        Get #hndFile, , arrFile
        Close #hndFile 

        BigReplace arrFile, arrBytes(0) & arrBytes(1), "" 

        hndFile = FreeFile
        Open strFolder & strFile For Binary As #hndFile
        Put #hndFile, , arrFile
        Close #hndFile
        Erase arrFile 

    ElseIf arrBytes(0) = &HEF And arrBytes(1) = &HBB And arrBytes(2) = &HBF Then 

        hndFile = FreeFile
        Open strFolder & strFile For Binary As #hndFile
        ReDim arrFile(0 To LOF(hndFile) - 1)
        Get #hndFile, , arrFile
        Close #hndFile
        BigReplace arrFile, arrBytes(0) & arrBytes(1) & arrBytes(2), "" 

        hndFile = FreeFile
        Open strFolder & strFile For Binary As #hndFile
        Put #hndFile, , arrFile
        Close #hndFile
        Erase arrFile 

    End If 


    strFile = ""
    strFile = Dir 

Loop 

If Len(strSchema) > 0 Then 

    strFile = strFolder & "Schema.ini" 

    hndFile = FreeFile
    Open strFile For Binary As #hndFile
    Put #hndFile, , strSchema
    Close #hndFile 

End If 


End Sub 
 

Public Sub BigReplace(ByRef arrBytes() As Byte, ByRef SearchFor As String, ByRef ReplaceWith As String)
On Error Resume Next 

Dim varSplit As Variant 

varSplit = Split(arrBytes, SearchFor)
arrBytes = Join$(varSplit, ReplaceWith) 

Erase varSplit 

End Sub

如果您知道可以将字节数组分配给 VBA.String，反之亦然，则代码会更容易理解。 BigReplace() 函数是一个 hack，它回避了一些 VBA 低效的字符串处理，尤其是分配：如果您以其他方式执行，您会发现大文件会导致严重的内存和性能问题。

Late to the game here, but I can't be the only coder who got got fed up with my SQL imports being broken by text files with a Byte Order Marker. There are very few 'Stack questions that touch on the problem - this is one of closest - so I'm posting an overlapping answer here.

I say 'overlapping' because the code below is solving a slightly different problem to yours - the primary purpose is writing a Schema file for a folder with a heterogeneous collection of files - but the BOM-handling segment is clearly marked.

The key functionality is that we iterate through all the '.csv' files in a folder, and we test each file with a quick nibble of the first four bytes: and we only only strip out the Byte Order Marker if we see one.

After that, we're working in low-level file-handling code from the primordial C. We have to, all the way down to using byte arrays, because everything else that you do in VBA will deposit the Byte Order Markers embedded in the structure of a string variable.

So, without further adodb, here's the code:

BOM-Disposal code for text files in a schema.ini file:



Public Sub SetSchema(strFolder As String)

On Error Resume Next 


' Write a Schema.ini file to the data folder. 


' This is necessary if we do not have the registry privileges to set the

' correct 'ImportMixedTypes=Text' registry value, which overrides IMEX=1 


' The code also checks for ANSI or UTF-8 and UTF-16 files, and applies a

' usable setting for CharacterSet ( UNICODE|ANSI ) with a horrible hack. 


' OEM codepage-defined text is not supported: further coding is required 


' ...And we strip out Byte Order Markers, if we see them - the OLEDB SQL

' provider for textfiles can't deal with a BOM in a UTF-16 or UTF-8 file 


' Not implemented: handling tab-delimited files or other delimiters. The

' code assumes a header row with columns, specifies 'scan all rows', and

' imposes 'read the column as text' if the data types are mixed. 


Dim strSchema   As String

Dim strFile     As String

Dim hndFile     As Long

Dim arrFile()  As Byte

Dim arrBytes(0 To 4) As Byte 


If Right(strFolder, 1) <> "\" Then strFolder = strFolder & "\" 


' Dir() is an iterator function when you call it with a wildcard: 


strFile = VBA.FileSystem.Dir(strFolder & "*.csv") 


Do While Len(strFile) > 0 


    hndFile = FreeFile

    Open strFolder & strFile For Binary As #hndFile

    Get #hndFile, , arrBytes

    Close #hndFile 


    strSchema = strSchema & "[" & strFile & "]" & vbCrLf

    strSchema = strSchema & "Format=CSVDelimited" & vbCrLf

    strSchema = strSchema & "ImportMixedTypes=Text" & vbCrLf

    strSchema = strSchema & "MaxScanRows=0" & vbCrLf 


    If arrBytes(2) = 0 Or arrBytes(3) = 0 Then   ' this is a hack

        strSchema = strSchema & "CharacterSet=UNICODE" & vbCrLf

    Else

        strSchema = strSchema & "CharacterSet=ANSI" & vbCrLf

    End If 


    strSchema = strSchema & "ColNameHeader = True" & vbCrLf

    strSchema = strSchema & vbCrLf 



    ' BOM disposal - Byte order marks confuse OLEDB text drivers: 


    If arrBytes(0) = &HFE And arrBytes(1) = &HFF _

    Or arrBytes(0) = &HFF And arrBytes(1) = &HFE Then 


        hndFile = FreeFile

        Open strFolder & strFile For Binary As #hndFile

        ReDim arrFile(0 To LOF(hndFile) - 1)

        Get #hndFile, , arrFile

        Close #hndFile 


        BigReplace arrFile, arrBytes(0) & arrBytes(1), "" 


        hndFile = FreeFile

        Open strFolder & strFile For Binary As #hndFile

        Put #hndFile, , arrFile

        Close #hndFile

        Erase arrFile 


    ElseIf arrBytes(0) = &HEF And arrBytes(1) = &HBB And arrBytes(2) = &HBF Then 


        hndFile = FreeFile

        Open strFolder & strFile For Binary As #hndFile

        ReDim arrFile(0 To LOF(hndFile) - 1)

        Get #hndFile, , arrFile

        Close #hndFile

        BigReplace arrFile, arrBytes(0) & arrBytes(1) & arrBytes(2), "" 


        hndFile = FreeFile

        Open strFolder & strFile For Binary As #hndFile

        Put #hndFile, , arrFile

        Close #hndFile

        Erase arrFile 


    End If 



    strFile = ""

    strFile = Dir 


Loop 


If Len(strSchema) > 0 Then 


    strFile = strFolder & "Schema.ini" 


    hndFile = FreeFile

    Open strFile For Binary As #hndFile

    Put #hndFile, , strSchema

    Close #hndFile 


End If 



End Sub 
 


Public Sub BigReplace(ByRef arrBytes() As Byte, ByRef SearchFor As String, ByRef ReplaceWith As String)

On Error Resume Next 


Dim varSplit As Variant 


varSplit = Split(arrBytes, SearchFor)

arrBytes = Join$(varSplit, ReplaceWith) 


Erase varSplit 


End Sub

The code's easier to understand if you know that a Byte Array can be assigned to a VBA.String, and vice versa. The BigReplace() function is a hack that sidesteps some of VBA's inefficient string-handling, especially allocation: you'll find that large files cause serious memory and performance problems if you do it any other way.

回复收藏 0 原文

~没有更多了~