@veniapputrii for the fun of it I asked ChatGPT (https://chat.openai.com/chat#) for a simple example and put that into a KNIME/Python node - and output the results to a data frame / KNIME table. Maybe you can check that out and read the additional hints about constructing such a test. This might also be done with KNIME nodes.
What I think would be important is to define the metrics of such a test. What would constitute a significant improvement to choose the B instead of the A - and typically such tests would involve a huge number of interactions. See also the comments the machine made
Using ChatGPT is fun and instructive since you can initialise a conversation and will get code examples and arguments - but please be careful. Right now this chat is impressive and seems to be optimised to give seemingly definitive answers and ‘polished’ summaries. Which unsurprisingly is what it has been designed for. While playing with it I have also encountered nice Python code which looked like this should work but it did not or not in that version (it looked too good to be true with the code parameters it was filling convincingly) - with Python you can test the results if they run at all - you also will have to be careful if the results will make sense from a professional perspective. So it is a nice tool that let you write code in no time - but you should not put blind trust in it.