I am currently working on a project that involves extracting information from numerous websites using the Selenium node. With thousands of URLs to process, I am facing a challenge in terms of the efficiency of my workflow.
I’ve come across the Chunk Loop Start node, which seems promising for handling multiple rows simultaneously. However, when I connect it to the Webdriver and create a flow variable, it opens three webdrivers with the same URL.
I believe there might be a misconfiguration as I am new to using this method. Ultimately, I aim to process three URLs simultaneously, each with its own unique webdriver instance.
There is an exemple what I try to do
You can also check the workflow in attachment
Any guidance or suggestions on how to achieve this would be greatly appreciated. Thank you in advance for your assistance.
Optimization Selenium whith Chunk Loop.knwf (2.0 KB)
Hello (the topic sounds familiar - have we been in touch recently via email? ),
Table Row To Variable should go into the loop body, so that it updates with each iteration.
Pure Chunk Loop Start will not parallelize anything - make sure to use the Parallel Chunk Start
Once you got it working, you might want to look into pooling, which can save some time by keeping started browsers in a pool, so that they do not need to be restarted within each iteration:
You’re right, we exchanged emails and you recommended me using the solution that involves parallelizing the process with the “Chunk Parallelize Start” node.
I just modified my workflow as you explained by integrating the flow variable inside the loop. The URLs are updating correctly, the only issue is that I can’t configure the “Parallel Chunk Start” node because I want to process the URLs in batches of 5. When I specify 5 in the parameters, it processes the first 5 correctly but doesn’t continue with the rest.
If I specify the number automatically, it opens 12 webdrivers, which corresponds to my number of rows.
What I would like to do is process the URLs in batches of 5, and when the first 5 are processed, another 5 are then processed, and so on. If I set the number of chunk to automatic, it might open a significant number of webdriver considering the large data I have.
Here is the modified workflow.
Thanks again for your help.
First question for me would be. Do those websites have an API?
You will need a nested structure, which will look somewhat like this - the outer loop handles the parallelization, the inner loop takes care of processing each parallel chunk row by row (to be honest, I wasn’t aware of that myself, as I haven’t touched the “parallel” nodes for ages):
This way you will process all URLs given in the input table.
Here’s the workflow:
Thank you very much for your help. After a few tries, I managed to get what I wanted. I took inspiration from your example. I replaced the Table Row to Variable node with the Table Row to Variable Loop Start node, added a Loop End just before the Parallel Chunk End node. It seems to be working; I have 5 web drivers running and finishing, and then another 5 start, and so on.
I hope my data won’t get mixed up using this technique, but in any case, it’s going to save me a considerable amount of time. I used to think I had to leave my PC running for several days to process all the URLs.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.