node.js大文件上传到mongoDB，阻止事件循环和工作池

发布于 2025-01-27 14:22:25 字数 2058 浏览 1 评论 0原文

因此，我想使用Node.js Server 使用Express，Mongoose和Multer的Gridfs Storage Engine将大型CSV文件上传到MongoDB云数据库处理任何其他API请求。例如，如果在上传文件时从数据库中获取用户的其他请求，服务器将接收请求并尝试从mongodb cloud中获取用户，，但是该请求将被卡住因为大文件上传会吞噬所有计算资源。结果，客户执行的获取请求不会返回用户 正在完成的文件上传。

我了解，如果线程需要很长时间才能执行回调（事件循环）或任务（工作者），则将其视为“阻止”，并且Node.js在事件循环中运行JavaScript代码时，请在其提供工人的情况下运行Javascript代码池以处理昂贵的任务，例如文件I/O。我已经阅读了为了使您的node.js服务器迅速保持，在任何给定时间与每个客户关联的工作都必须为“小”，并且我的目标应该是最小化任务时间的变化 。原因是，这是，如果工人的当前任务比其他任务要贵得多，那么处理其他待处理任务的工作将不可用，从而使工人池的大小减少一个，直到完成任务为止。

换句话说，执行大文件上传的客户端正在执行一项昂贵的任务，该任务减少了工作池的吞吐量，进而减少了服务器的吞吐量。根据上述博客文章，当每个子任务完成时，应提交下一个子任务，并且在完成最终子任务后，应通知提交者。 这样，在长任务的每个子任务之间阻止问题。

但是，我不知道如何在实际代码中实现此解决方案。是否有可以解决此问题的特定分区功能？我是否必须使用特定的上传体系结构或除multer-gridfs-Storage以外的其他节点软件包才能上传我的文件？请帮助

这里是我当前使用Multer的GRIDFS存储引擎实现的文件上传实现：

   // Adjust how files get stored.
   const storage = new GridFsStorage({
       // The DB connection
       db: globalConnection, 
       // The file's storage configurations.
       file: (req, file) => {
           ...
           // Return the file's data to the file property.
           return fileData;
       }
   });

   // Configure a strategy for uploading files.
   const datasetUpload = multer({ 
       // Set the storage strategy.
       storage: storage,

       // Set the size limits for uploading a file to 300MB.
       limits: { fileSize: 1024 * 1024 * 300 },
    
       // Set the file filter.
       fileFilter: fileFilter,
   });


   // Upload a dataset file.
   router.post('/add/dataset', async (req, res)=>{
       // Begin the file upload.
       datasetUpload.single('file')(req, res, function (err) {
           // Get the parsed file from multer.
           const file = req.file;
           // Upload Success. 
           return res.status(200).send(file);
       });
   });

原文

So I want to upload large CSV files to a mongoDB cloud database using a Node.js server using Express, Mongoose and Multer's GridFS storage engine, but when the file upload starts, my database becomes unable to handle any other API requests. For example, if a different client requests to get a user from the database while the file is being uploaded, the server will recieve the request and try to fetch the user from the MongoDB cloud, but the request will get stuck because the large file upload eats up all the computational resources. As a result, the get request performed by the client will not return the user until the file upload that is in progress is completed.

I understand that if a thread is taking a long time to execute a callback (Event loop) or a task (Worker), then it is considered "blocked" and that Node.js runs JavaScript code in the Event Loop while it offers a Worker Pool to handle expensive tasks like file I/O. I've read on this blog post by NodeJs.org that in order to keep your Node.js server speedy, the work associated with each client at any given time must be "small" and that my goal should be to minimize the variation in Task times. The reasoning behing this is that if a Worker's current Task is much more expensive than other Tasks, it will be unavailable to work on other pending Tasks, thus decreasing the size of the Worker Pool by one, until the Task is completed.

In other words, the client performing the large file upload is executing an expensive Task that decreases the throughput of the Worker Pool, in turn decreasing the throughput of the server. According to the aforementioned blog post, when each sub-task completes it should submit the next sub-Task, and when the final sub-Task is done, it should notify the submitter. This way, between each sub-Task of the long Task (the large file upload), the Worker can work on a sub-Task from a shorter Task, thus solving the blocking problem.

However, I do not know how to implement this solution in actual code. Are there any specific partioned functions that can solve this problem? Do I have to use a specific upload architecture or a node package other than multer-gridfs-storage to upload my files? Please help

Here is my current file upload implementation using Multer's GridFS storage engine:

   // Adjust how files get stored.
   const storage = new GridFsStorage({
       // The DB connection
       db: globalConnection, 
       // The file's storage configurations.
       file: (req, file) => {
           ...
           // Return the file's data to the file property.
           return fileData;
       }
   });

   // Configure a strategy for uploading files.
   const datasetUpload = multer({ 
       // Set the storage strategy.
       storage: storage,

       // Set the size limits for uploading a file to 300MB.
       limits: { fileSize: 1024 * 1024 * 300 },
    
       // Set the file filter.
       fileFilter: fileFilter,
   });


   // Upload a dataset file.
   router.post('/add/dataset', async (req, res)=>{
       // Begin the file upload.
       datasetUpload.single('file')(req, res, function (err) {
           // Get the parsed file from multer.
           const file = req.file;
           // Upload Success. 
           return res.status(200).send(file);
       });
   });

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深海里的那抹蓝 2025-02-03 14:22:25

因此，经过几天的研究，我发现问题的根不是node.js或我的文件上传实现。 问题是Mongodb Atlas无法与其他操作（例如从我的数据库获取用户）同时处理文件上传工作负载。正如我在问题帖子中所说的那样，Node.js正在接收其他客户的API呼叫，但他们没有返回任何结果。我现在意识到那是因为他们被陷入了DB级别。一旦我切换到MongoDB的本地部署，问题就解决了。

根据此博客文章有关mongodb的最佳实践相对于CPU的数量，活动线程的总数（即同时操作）可能会影响性能，从而影响Node.js服务器的吞吐量。但是，我尝试使用最多8个VCPU（M50群集软件包）的专用MongoDB群集，而MongoDB Atlas仍无法在处理其他客户端请求时无法上传该文件。

如果有人将其与云解决方案一起使用，我想知道更多。谢谢。

回复收藏 0 原文

耀眼的星火 2025-02-03 14:22:25

我认为此问题来自buffer。因为缓冲区必须接收所有块，然后将整个缓冲区发送到
消费者，因此缓冲需要很长时间。流可以解决此问题，因此流使我们能够在从源到达一旦来处理数据，并且可以通过缓冲数据来做无法做无法做到的事情和一次处理所有处理。我在multer github页面上找到了storage.fromstream（）方法，并通过上传 122 MB文件来对其进行测试，它对我有用，多亏了Node.js流，
收到的数据库后将消耗并保存到云数据库中。上传的总时间小于 1分钟，服务器可以轻松地响应在上传过程中对其他请求。

const {GridFsStorage} = require('multer-gridfs-storage');
const multer = require('multer');
const upload = multer({ dest: 'uploads/' });
const express = require('express');
const fs = require('fs');
const connectDb = require('./connect');
const app = express();
 
const storage = new GridFsStorage({db:connectDb()});

app.post('/profile', upload.single('file'), function (req, res, next) {
  const {file} = req;
  const stream = fs.createReadStream(file.path); //creates stream
  storage.fromStream(stream, req, file)
    .then(() => res.send('File uploaded')) //saves data as binary to cloud db
    .catch(() => res.status(500).send('error'));
});
app.get('/profile',(req,res)=>{
    res.send("hello");
})

app.listen(5000);

I think this problem is sourced from the buffer. Because the buffer has to receive all chunks and then the entire buffer is sent to the
consumer, so buffering takes a long time. Streams can solve this problem so streams allow us to process the data as soon as it arrives from the source and to do things that would not be possible by buffering data and processing it all at once. I found storage.fromStream() method on the multer GitHub page and tested it by uploading a 122 MB file, it worked for me, thanks to Node.js streams, every
chunk of data is consumed and saved to the cloud database as soon as it is received. the total time of uploads had been less than 1 minute, and the server could easily respond to the other requests during the upload.

const {GridFsStorage} = require('multer-gridfs-storage');
const multer = require('multer');
const upload = multer({ dest: 'uploads/' });
const express = require('express');
const fs = require('fs');
const connectDb = require('./connect');
const app = express();
 
const storage = new GridFsStorage({db:connectDb()});

app.post('/profile', upload.single('file'), function (req, res, next) {
  const {file} = req;
  const stream = fs.createReadStream(file.path); //creates stream
  storage.fromStream(stream, req, file)
    .then(() => res.send('File uploaded')) //saves data as binary to cloud db
    .catch(() => res.status(500).send('error'));
});
app.get('/profile',(req,res)=>{
    res.send("hello");
})

app.listen(5000);

回复收藏 0 原文