node.js大文件上传到mongoDB,阻止事件循环和工作池
因此,我想使用Node.js Server 使用Express,Mongoose和Multer的Gridfs Storage Engine将大型CSV文件上传到MongoDB云数据库处理任何其他API请求。例如,如果在上传文件时从数据库中获取用户的其他请求,服务器将接收请求并尝试从mongodb cloud中获取用户,,但是该请求将被卡住因为大文件上传会吞噬所有计算资源。结果,客户执行的获取请求不会返回用户 正在完成的文件上传。
我了解,如果线程需要很长时间才能执行回调(事件循环)或任务(工作者),则将其视为“阻止”,并且Node.js在事件循环中运行JavaScript代码时,请在其提供工人的情况下运行Javascript代码池以处理昂贵的任务,例如文件I/O。我已经阅读了为了使您的node.js服务器迅速保持,在任何给定时间与每个客户关联的工作都必须为“小”,并且我的目标应该是最小化任务时间的变化 。原因是,这是,如果工人的当前任务比其他任务要贵得多,那么处理其他待处理任务的工作将不可用,从而使工人池的大小减少一个,直到完成任务为止。
换句话说,执行大文件上传的客户端正在执行一项昂贵的任务,该任务减少了工作池的吞吐量,进而减少了服务器的吞吐量。根据上述博客文章,当每个子任务完成时,应提交下一个子任务,并且在完成最终子任务后,应通知提交者。 这样,在长任务的每个子任务之间阻止问题。
但是,我不知道如何在实际代码中实现此解决方案。是否有可以解决此问题的特定分区功能?我是否必须使用特定的上传体系结构或除multer-gridfs-Storage以外的其他节点软件包才能上传我的文件?请帮助
这里是我当前使用Multer的GRIDFS存储引擎实现的文件上传实现:
// Adjust how files get stored.
const storage = new GridFsStorage({
// The DB connection
db: globalConnection,
// The file's storage configurations.
file: (req, file) => {
...
// Return the file's data to the file property.
return fileData;
}
});
// Configure a strategy for uploading files.
const datasetUpload = multer({
// Set the storage strategy.
storage: storage,
// Set the size limits for uploading a file to 300MB.
limits: { fileSize: 1024 * 1024 * 300 },
// Set the file filter.
fileFilter: fileFilter,
});
// Upload a dataset file.
router.post('/add/dataset', async (req, res)=>{
// Begin the file upload.
datasetUpload.single('file')(req, res, function (err) {
// Get the parsed file from multer.
const file = req.file;
// Upload Success.
return res.status(200).send(file);
});
});
So I want to upload large CSV files to a mongoDB cloud database using a Node.js server using Express, Mongoose and Multer's GridFS storage engine, but when the file upload starts, my database becomes unable to handle any other API requests. For example, if a different client requests to get a user from the database while the file is being uploaded, the server will recieve the request and try to fetch the user from the MongoDB cloud, but the request will get stuck because the large file upload eats up all the computational resources. As a result, the get request performed by the client will not return the user until the file upload that is in progress is completed.
I understand that if a thread is taking a long time to execute a callback (Event loop) or a task (Worker), then it is considered "blocked" and that Node.js runs JavaScript code in the Event Loop while it offers a Worker Pool to handle expensive tasks like file I/O. I've read on this blog post by NodeJs.org that in order to keep your Node.js server speedy, the work associated with each client at any given time must be "small" and that my goal should be to minimize the variation in Task times. The reasoning behing this is that if a Worker's current Task is much more expensive than other Tasks, it will be unavailable to work on other pending Tasks, thus decreasing the size of the Worker Pool by one, until the Task is completed.
In other words, the client performing the large file upload is executing an expensive Task that decreases the throughput of the Worker Pool, in turn decreasing the throughput of the server. According to the aforementioned blog post, when each sub-task completes it should submit the next sub-Task, and when the final sub-Task is done, it should notify the submitter. This way, between each sub-Task of the long Task (the large file upload), the Worker can work on a sub-Task from a shorter Task, thus solving the blocking problem.
However, I do not know how to implement this solution in actual code. Are there any specific partioned functions that can solve this problem? Do I have to use a specific upload architecture or a node package other than multer-gridfs-storage to upload my files? Please help
Here is my current file upload implementation using Multer's GridFS storage engine:
// Adjust how files get stored.
const storage = new GridFsStorage({
// The DB connection
db: globalConnection,
// The file's storage configurations.
file: (req, file) => {
...
// Return the file's data to the file property.
return fileData;
}
});
// Configure a strategy for uploading files.
const datasetUpload = multer({
// Set the storage strategy.
storage: storage,
// Set the size limits for uploading a file to 300MB.
limits: { fileSize: 1024 * 1024 * 300 },
// Set the file filter.
fileFilter: fileFilter,
});
// Upload a dataset file.
router.post('/add/dataset', async (req, res)=>{
// Begin the file upload.
datasetUpload.single('file')(req, res, function (err) {
// Get the parsed file from multer.
const file = req.file;
// Upload Success.
return res.status(200).send(file);
});
});
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
因此,经过几天的研究,我发现问题的根不是node.js或我的文件上传实现。 问题是Mongodb Atlas无法与其他操作(例如从我的数据库获取用户)同时处理文件上传工作负载。正如我在问题帖子中所说的那样,Node.js正在接收其他客户的API呼叫,但他们没有返回任何结果。我现在意识到那是因为他们被陷入了DB级别。一旦我切换到MongoDB的本地部署,问题就解决了。
根据此博客文章有关mongodb的最佳实践相对于CPU的数量,活动线程的总数(即同时操作)可能会影响性能,从而影响Node.js服务器的吞吐量。但是,我尝试使用最多8个VCPU(M50群集软件包)的专用MongoDB群集,而MongoDB Atlas仍无法在处理其他客户端请求时无法上传该文件。
如果有人将其与云解决方案一起使用,我想知道更多。谢谢。
So after a couple of days of research, I found out that the root of the problem wasn't Node.JS or my file upload implementation. The problem was that MongoDB Atlas couldn't handle the file upload workload at the same time as other operations such as fetching users from my database. As I've stated in the question post, Node.js was receiving API calls from other clients as it should be, but they weren't returning any results. I now realize that was because they were getting stuck at the DB level. Once I switched to a local deployment of MongoDB, the problem was resolved.
According to this blog post about MongoDB Best Practices the total number of active threads (i.e., concurrent operations) relative to the number of CPUs can impact performance and therefore the throughput of the Node.js server. However, I've tried using dedicated MongoDB clusters with up to 8 vCPUs (the M50 cluster package) and MongoDB Atlas still could NOT upload the file while handling other client requests.
If someone made it work with a cloud solution I'd like to know more. Thank you.
我认为此问题来自
buffer
。因为缓冲区必须接收所有块,然后将整个缓冲区发送到消费者,因此缓冲需要很长时间。 流可以解决此问题,因此流使我们能够在从源到达一旦来处理数据,并且可以通过缓冲数据来做无法做无法做到的事情和一次处理所有处理。我在multer github页面上找到了
storage.fromstream()
方法,并通过上传 122 MB文件来对其进行测试,它对我有用,多亏了Node.js流,收到的数据库后将消耗并保存到云数据库中。上传的总时间小于 1分钟,服务器可以轻松地响应在上传过程中对其他请求。
I think this problem is sourced from the
buffer
. Because the buffer has to receive all chunks and then the entire buffer is sent to theconsumer, so buffering takes a long time. Streams can solve this problem so streams allow us to process the data as soon as it arrives from the source and to do things that would not be possible by buffering data and processing it all at once. I found
storage.fromStream()
method on the multer GitHub page and tested it by uploading a 122 MB file, it worked for me, thanks to Node.js streams, everychunk of data is consumed and saved to the cloud database as soon as it is received. the total time of uploads had been less than 1 minute, and the server could easily respond to the other requests during the upload.
我也遇到了类似的问题,(以某种方式)我所做的是实现“ noreflow noreferrer”>多个连接 for MongoDB。
因此,上传操作将通过新的mongoDB连接来处理,在上传过程中,您仍然可以使用另一个连接查询数据库。
https://thecodebarbarian.com/slow-train-slow-train-in-slow-train-in-mong train-in-mongodb-and-odb-and-yodb-and-------and--------- nodejs
I was having a similar issue, and what I did to solve this (in some way) was to implement multiple connections for MongoDB.
So the upload operation will be handled by a new MongoDB connection and during the uploading process you could still query the database using another connection.
https://thecodebarbarian.com/slow-trains-in-mongodb-and-nodejs
您可以管理体系结构/基础架构吗?如果是这样,则最好通过不同的方法来解决这一挑战。这实际上是无服务器解决方案的理想选择,即lambda。
Lambda并未在一台计算机上运行任何请求。 lambda将一个请求分配给一台机器,直到请求完成后,该计算机将无法收到任何其他流量。因此,您永远不会达到您现在遇到的极限。
Can you manage architecture/infrastructure? If so, this challenge would be best solved by different approach. This is actually perfect candidate for serverless solution, i.e. Lambda.
Lambda does not run any requests on one machine in parallel. Lambda assign one request to one machine and until the request is finished this machine will not receive any other traffic. Therefore you will never hit the limits you are encountering now.