Should I remove duplicate rows generated after pre processing and transforming steps?

ricardo_martins · December 31, 2022, 2:22pm

I realize that my training dataset has a lot of duplicated rows.

I already checked the parameters of normalizer and columns created on rule engine, to avoid the maximum of information loss.

I did a test and I know that duplicate rows change my model result, so I’m confused about what I have to do.

ScottF · January 3, 2023, 8:48pm

Hi @ricardo_martins -

I would try removing the duplicate rows before you normalize or otherwise work on your data. You can most easily do this using the Duplicate Row Filter node.

ricardo_martins · January 4, 2023, 11:35pm

Thanks, @ScottF, but I explained badly, sorry.

Duplicates arise after working on the data, resulting from rules and normalizations created.
For example, I created a column checking whether or not the person has a valid cell phone number to receive promotional newsletters.
After a few rules like this, single rows become duplicate rows.

mlauber71 · January 4, 2023, 11:41pm

@ricardo_martins this sounds very odd. I would recommend to check this since simple rules should not result in duplicatea. Are you using some joins in the process?

ricardo_martins · January 6, 2023, 11:57am

@mlauber71

No they are the result of generalizations. For example, two peoples in the same age group (generalization), also, in the same state (cities generalization) etc.

All these generalizations I make in my workflow and after some them I have a dataset with some identical rows.

system · April 6, 2023, 11:58am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.