Should I remove duplicate rows generated after pre processing and transforming steps?

I realize that my training dataset has a lot of duplicated rows.

I already checked the parameters of normalizer and columns created on rule engine, to avoid the maximum of information loss.

I did a test and I know that duplicate rows change my model result, so I’m confused about what I have to do.

Hi @ricardo_martins -

I would try removing the duplicate rows before you normalize or otherwise work on your data. You can most easily do this using the Duplicate Row Filter node.

1 Like

Thanks, @ScottF, but I explained badly, sorry.

Duplicates arise after working on the data, resulting from rules and normalizations created.
For example, I created a column checking whether or not the person has a valid cell phone number to receive promotional newsletters.
After a few rules like this, single rows become duplicate rows.

@ricardo_martins this sounds very odd. I would recommend to check this since simple rules should not result in duplicatea. Are you using some joins in the process?


No they are the result of generalizations. For example, two peoples in the same age group (generalization), also, in the same state (cities generalization) etc.

All these generalizations I make in my workflow and after some them I have a dataset with some identical rows.