I am running a Python script that uses Selenium in the script. Selenium is opening the webpage, logging in and making some website selections.
I would like to then loop over URLs with the rest of my code without needing to open the browser and login each time/instance.
Is it possible to use a Python node to get Chrome open and logged in with Selenium, and then carry the current browser instance over to another Python node that would then have the code I am building into a loop.
This is a web scraping project where each url is opened, scrapped, turned into a Pandas df and then output to CSV. I have all of the code and can run single instances at a time inside of KNIME. I just need to be able to do this in bulk.
I really appreciate any thoughts/assistance!
I have never used Selenium myself. How do you authenticate and what does one have on intermediate steps? Is there some python object that represents a chrome instance with logged credentials? In yes, do you want to simply pass that object around KNIME nodes in a loop?
Yes exactly! I am using Selenium in Python to authenticate and open chrome, and I want to pass that as an object so it stays logged in and authenticated and simply then gets the page I need.
I just don’t know how to convert/transform that piece to an object - or if that would actually work in Python KNIME nodes.
All my past experiences using Selenium involved maintaining a connection between my current Python session and the browser. I have not seen an example using Selenium to interact with the same browser session from two or more Python shells, which is effectively what you are asking how to do because each Python node (in your case, in a loop) will start a new Python shell. Reconnecting to an existing browser session through Selenium sounds like it might be outside the scope of what Selenium supports, but perhaps searching for this topic will turn something up for you.
As to the idea of passing a Python object between KNIME’s Python nodes, this will require that the object be serialized via pickle. Generally, stateful objects such as a socket connection are inherently not serializable (in Python or any other language).
I do not know if this suggestion applies well to your situation, but would it be possible to break your task into two parts? The first Python node would use Selenium to connect + prompt for authentication + scrape the first page for urls which you then output as a pandas DataFrame. The second Python node would use Selenium to connect + prompt for authentication (only once more) + scrape all urls provided as input to the node. This would permit you to filter or otherwise refine which urls survive and make it as input to the second Python node. This means you would perform a loop inside the second Python node but you will not need to twist or otherwise convince Selenium to do anything more fancy than you already are; and there are plenty of examples of using Selenium to do this in a single Python session in just this way.
Thank you for the insights! I like your idea of the two nodes to handle the looping. I’ll give it a try to post what I’m able to figure out.
Maybe not feasible as you already have the Python code in place, but are you aware there are ready-to-use Selenium Nodes for KNME available? @qqilihq – the node developer – also provides trial licenses for evaluating and testing. Maybe this trial is sufficient for your project.