在目录上迭代并仅提取文件名,而无需阅读有效载荷

发布于 2025-01-31 10:21:54 字数 3190 浏览 5 评论 0原文

我正在使用前提上的Mule 4.4社区版。 多亏了帮助,我才能够阅读一个大文件而不消耗内存和处理它,这一切都很好( tere )。

现在进一步依据 - 我的用例是从目录中读取所有.CSV文件。 然后对它们进行一个一个一个:

\opt\out\
         students.csv
         teachers.csv
         collesges.csv
         ....

因此,我的计划是在目录中列出文件:

<sftp:list doc:name="List" config-ref="SFTP_Config" directoryPath="/opt/out">
    <non-repeatable-iterable />
    <sftp:matcher filenamePattern="#['*.csv' ]"
                  directories="EXCLUDE" symLinks="EXCLUDE" />
</sftp:list>

然后我想从目录中读取文件名 ,而 not> not 读取有效载荷。

按照此早期访问文章iToble /&gt; < /code>。但是,在我尝试提取属性时按照文章的列表文件操作之后:

<set-payload doc:name="Set Payload"  value="#[output application/json --- payload map $.attributes]"/>

没有属性...(我的计划是提取文件名,然后为每个文件名的循环运行a 然后选择条件以确定文件名是否具有 student ,使用Student Transformer,如果老师使用教师变压器等)

但是,由于属性不可用,我是无法将文件名传递给 循环(尚未编写)。

因此,我从&lt; non-repeatable-titerable /&gt; < /code>转换为&lt; lt; lt;

;

<sftp:list doc:name="List" config-ref="SFTP_Config" directoryPath="/opt/out">
    <repeatable-in-memory-iterable />
    <sftp:matcher filenamePattern="#['*.csv' ]"
                  directories="EXCLUDE" symLinks="EXCLUDE" />
</sftp:list>

lt 提取文件名的属性。

我对以下内容感到困惑:

  1. 要在上述目录中处理的文件将很大(每个文件700 MB),因此,在使用使用可重复的内存迭代目录时,引起任何记忆问题? (我做不是想要读取文件内容,只需在此阶段获取文件名),

这是到目前为止的完整有效负载(注意 - 它不包含任何 for loop to迭代文件,我将插入...)

<flow name="employee-process-flow">
    <http:listener doc:name="Listener"  config-ref="HTTP_Listener_config" path="/processFiles"/>
    <set-variable value='#[now() as String { format: "ddMMuu" }]' doc:name="Set todays date as ddmmyy" doc:id="c6a91a41-65b1-46df-a720-9c13fe360b6b" variableName="today"/>

    <sftp:list doc:name="List" config-ref="SFTP_Config" directoryPath="/opt/out">
    <repeatable-in-memory-iterable />
    <sftp:matcher filenamePattern="#['*.csv' ]"
        directories="EXCLUDE" symLinks="EXCLUDE" />
    </sftp:list>

    <set-payload doc:name="Set Payload" value="#[output application/json --- payload map $.attributes]"/>
    <foreach doc:name="For Each" >
        <logger level="INFO" doc:name="Logger"  message="we are here"/>
    </foreach>

</flow>

I am using the Mule 4.4 community edition on premise.
Thanks to help, I have been able to read a large file without consuming memory and processing it, which is all good (here).

Now building on this further - my use case is to read all .csv files from within a directory.
And then process them one by one:

\opt\out\
         students.csv
         teachers.csv
         collesges.csv
         ....

So my plan was to list the files in the directory:

<sftp:list doc:name="List" config-ref="SFTP_Config" directoryPath="/opt/out">
    <non-repeatable-iterable />
    <sftp:matcher filenamePattern="#['*.csv' ]"
                  directories="EXCLUDE" symLinks="EXCLUDE" />
</sftp:list>

And then I wanted to only read file names from directory and not read payload.

As per this early access article we are advised to use <non-repeatable-iterable />. However, after the list file operation as per article when I try to extract attributes:

<set-payload doc:name="Set Payload"  value="#[output application/json --- payload map $.attributes]"/>

No attributes are available... (my plan is to extract the file names and then run a for loop for each file name and then a choice condition to determine if file name has student, use student transformer, if teacher use teacher transformer, etc.)

However, as attributes are not available, I am not able to pass file names to the for loop (yet to be written).

So I changed from <non-repeatable-iterable /> to <repeatable-in-memory-iterable />

Code below:

<sftp:list doc:name="List" config-ref="SFTP_Config" directoryPath="/opt/out">
    <repeatable-in-memory-iterable />
    <sftp:matcher filenamePattern="#['*.csv' ]"
                  directories="EXCLUDE" symLinks="EXCLUDE" />
</sftp:list>

Using the above, I can extract the attributes of file names.

I am confused about the following:

  1. The files to be processed in the above directory will be large (each file 700 MB), so while iterating the directory by using repeatable-in-memory-iterable, will it cause any memory issues? (I do not want to read file content, simply get file names at this stage)

Here is the complete payload till now (note - it does not contain any for loop to iterate over files, which I will plug in...)

<flow name="employee-process-flow">
    <http:listener doc:name="Listener"  config-ref="HTTP_Listener_config" path="/processFiles"/>
    <set-variable value='#[now() as String { format: "ddMMuu" }]' doc:name="Set todays date as ddmmyy" doc:id="c6a91a41-65b1-46df-a720-9c13fe360b6b" variableName="today"/>

    <sftp:list doc:name="List" config-ref="SFTP_Config" directoryPath="/opt/out">
    <repeatable-in-memory-iterable />
    <sftp:matcher filenamePattern="#['*.csv' ]"
        directories="EXCLUDE" symLinks="EXCLUDE" />
    </sftp:list>

    <set-payload doc:name="Set Payload" value="#[output application/json --- payload map $.attributes]"/>
    <foreach doc:name="For Each" >
        <logger level="INFO" doc:name="Logger"  message="we are here"/>
    </foreach>

</flow>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

掩饰不了的爱 2025-02-07 10:21:54

列表操作返回消息列表,每个列表都有有效负载和属性。文件的内容以懒惰的方式返回为有效载荷,这意味着仅当您尝试访问该元素的有效负载时,才能读取文件的内容。

有道理的是,如果您是不可纠正的术语,并且不访问&lt; foreach&gt;中每个项目的有效载荷,那么您就不应该有任何内存问题,因为内容没有阅读。

通过在内存中使用可重复的流,可以将整个有效载荷读取到内存中。尝试阅读一些大小千兆字节的文件,看看那里发生了什么。

我不确定属性问题是什么。它应该在任何流模式下工作。

请注意,如果您打算使用属性(而不是打印它们)进行操作,则应输出到application/java而不是JSON,以避免往返JSON的不需要的转换。例如,在您的流程中,输出用作&lt; foreach&gt;的输入,因此最好是Java。

例子:
输出应用程序/Java ---有效载荷地图$ .ATTRIBUTES

The List operation returns a list of messages, and each has a payload and attributes. The content of the files is returned as the payload, in a lazy way, meaning that the file's content is read only if you try to access that element's payload.

It makes sense that if you a non-repeatable-iterator and don't access the payload of each item in the <foreach> then you should not have any memory issues, because the contents are not read.

By using in memory repeatable streaming it is possible that the entire payload is being read into memory. Try reading a file a few gigabytes in size and see what happens there.

I'm not sure what the problem is with the attributes. It should work the same in any streaming mode.

Note that if you plan on doing something with the attributes—other than printing them—then you should output to application/java instead of JSON, to avoid unneeded conversions to and from JSON. For example, in your flow the output is used as input for the <foreach>, so it would be better for it to be Java.

Example:
output application/java --- payload map $.attributes

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文