knime problem statement for a project to be delivered urgently

mlauber71 · July 10, 2021, 10:41am

@vishalpat13 I agree with @takbb who has done a great job pointing out the questions that come with you data.

I do not want to insist too much but I still think you might benefit from my article and sample workflows about duplicates and how to deal with them, since this is a very common problem in Data engineering (ETL) and you will have to make decisions - and best you do it deliberately and consciously.

The impulse stems from Big Data environments since duplicates are a constant worry there since no primary keys would hinder them and sometimes different systems just dump data into a data lake without further checks. So there might be some parallels to what you have with ‚unregulated‘ Excel files. For practical reasons if you want to use a window function with several conditions you could do that with a local H2 database that would just live in a single file on your computer.

Also it might become necessary that you ask the person responsible for the data for definitions and decisions. What to do if an ID does not match or several do match (do we throw them away or list them or have two entries). Sometimes they are not that eager to answer because that would force decisions regarding possibly complicated or lacking business processes and now it is the data engineer / data scientist that is supposed to fix that with some ‚magic‘ (… can‘t you do something with AI or DeepLearning or so ).