当换行符为“/r”时,如何读取文本文件的行而不是“/n”?

发布于 2024-11-03 10:23:48 字数 677 浏览 5 评论 0原文

我有一个巨大的 .txt 文件,其中包含数万个形容词的列表。在文本文件中,每个单词都独占一行。我使用以下函数将其读入列表(然后使用 Array.of_list 将其放入数组中):

let read_file filename = 
    let lines = ref [] in
    let chan = open_in filename in
      try
        while true; do
      lines := input_line chan :: !lines
        done; []
      with End_of_file ->
        close_in chan;
        List.rev !lines ;;

但它不起作用,因为换行符用 /r 表示code> 而不是 /n。我最终得到一个包含一个元素的列表,基本上如下所示: ["abacinate\rabandon\rabase\rabash\rabate\rabbreviate\rabdicate"]

更改换行符的最佳方法是什么/r/n?或者有没有办法读取文本文件,以便我可以告诉它在到达 /r 时在列表中创建一个新元素?

I have a massive .txt file with a list of tens of thousands of adjectives. In the text files, each word is on its own line. I read it into a list (that I then put into an array using Array.of_list) with the following function:

let read_file filename = 
    let lines = ref [] in
    let chan = open_in filename in
      try
        while true; do
      lines := input_line chan :: !lines
        done; []
      with End_of_file ->
        close_in chan;
        List.rev !lines ;;

But it's not working because the line breaks are being represented with /r and not /n. I end up with a list with one element that basically looks like this: ["abacinate\rabandon\rabase\rabash\rabate\rabbreviate\rabdicate"]

What is the best way to change the line breaks from /r to /n? Or is there a way to read in the text file so that I can tell it to make a new element in the list when it gets to /r?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

霓裳挽歌倾城醉 2024-11-10 10:23:48

好吧,您当然可以在 OCaml 中使用正则表达式进行某种替换...例如,您可以将整个文件读入字符串并进行替换。但是,如果您的文本文件没有更改(我猜在这种情况下不会更改,因为它只是一大堆形容词),我将使用文本编辑器的搜索和替换功能在文本文件本身中进行替换,而不是尝试在 OCaml 程序中执行此操作。

如果您安装了 dos2unix,您可以使用它来进行翻译。您还可以使用这样的方法:

perl -pi -e 's/\r/\n/' filename

...使用这种方法意味着您更改文件一次,然后就完成了它,而不是总是在程序中进行替换,这将在每次运行时花费一点额外的时间程序。

Well, you could certainly play around with doing some sort of substitution with a regex in OCaml... For example, you could read in the whole file into a string and do the substitution. However, if your textfile doesn't change (and I'm guessing that it doesn't in this case as it's just a big list of adjectives) I would use my text editor's search and replace facilities to do the replace in the textfile itself, as opposed to trying to do it in your OCaml program.

If you have dos2unix installed you could use that to do the translation. You could also use something like this:

perl -pi -e 's/\r/\n/' filename

...using this approach means you change the file once and you're done with it as opposed to always doing the substitution in your program which will take a little bit of extra time every time you run the program.

挽清梦 2024-11-10 10:23:48

从技术上讲,如果您的文件具有 \r 分隔的记录,而不是 \n 分隔的记录,则它不是由行组成的文本文件。它是其他格式的文件,恰好是其他平台的文本格式。所以 将文件转换为文本文件是显而易见的解决方案。

如果您需要程序处理换行符,则必须编写 input_line 的替换,因为它具有内置行的本机概念(即 LF on unix,在 OSX 之前的 MacOS 上为 CR,在 DOS 和 Windows 上为 CR LF)。

由于无论如何您都将整个文件读入内存,因此您可以在 Buffer 中读取全部内容。请注意,除非您事先知道文件大小(然后您也可以将其读入字符串中),否则 Buffer.add_channel 不会起作用。未经测试:

let input_until_eof (chan : in_channel) : string =
  let buf = Buffer.create 10000 and tmp = String.create 4096 and n = ref 0 in
  while n := input chan tmp 0 (String.length tmp); n <> 0 do
    Buffer.add_substring buf tmp
  done;
  Buffer.contents buf
let tolerant_newline_regexp = Str.regexp "\r\\|\n\\|\013\|\010\013?"
let input_all_lines chan : string list =
  Str.split tolerant_newline_regexp (input_until_eof chan)

如果您要对文件内容进行进一步解析,请使用 Stream 模块或 Ocamllex。

Technically, if your file has \r-separated records and not \n-separated records, it's not a text files made of lines. It's a file in some other format, which happens to be the text format of some other platform. So converting the file to a text file is the obvious solution.

If you need your program to cope with newlines, you'll have to write a replacement to input_line, because it has the native notion of line built-in (i.e. LF on unix, CR on MacOS before OSX, CR LF on DOS and Windows).

Since you're reading the whole file into memory anyway, you can read it all in a Buffer. Note that Buffer.add_channel won't work unless you know the file size in advance (and then you might as well read it into a string). Untested:

let input_until_eof (chan : in_channel) : string =
  let buf = Buffer.create 10000 and tmp = String.create 4096 and n = ref 0 in
  while n := input chan tmp 0 (String.length tmp); n <> 0 do
    Buffer.add_substring buf tmp
  done;
  Buffer.contents buf
let tolerant_newline_regexp = Str.regexp "\r\\|\n\\|\013\|\010\013?"
let input_all_lines chan : string list =
  Str.split tolerant_newline_regexp (input_until_eof chan)

If you're going to do further parsing on the file contents, use the Stream module or Ocamllex.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文