Should I remove duplicate rows generated after pre processing and transforming steps?

I realize that my training dataset has a lot of duplicated rows.

I already checked the parameters of normalizer and columns created on rule engine, to avoid the maximum of information loss.

I did a test and I know that duplicate rows change my model result, so I’m confused about what I have to do.

Hi @ricardo_martins -

I would try removing the duplicate rows before you normalize or otherwise work on your data. You can most easily do this using the Duplicate Row Filter node.

1 Like

Thanks, @ScottF, but I explained badly, sorry.

Duplicates arise after working on the data, resulting from rules and normalizations created.
For example, I created a column checking whether or not the person has a valid cell phone number to receive promotional newsletters.
After a few rules like this, single rows become duplicate rows.

@ricardo_martins this sounds very odd. I would recommend to check this since simple rules should not result in duplicatea. Are you using some joins in the process?

@mlauber71

No they are the result of generalizations. For example, two peoples in the same age group (generalization), also, in the same state (cities generalization) etc.

All these generalizations I make in my workflow and after some them I have a dataset with some identical rows.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.