I am curious if anyone is aware of a method to do the following:
- Analyze a column with rows filled with strings such as:
Row 1: 77XYZ21
Row 2: XYZ277
Row 3: 244XYZ
Row 4: 7889922
“Learn” anytime a sequence of characters within a string is repeated in more than one row. In this case it would be XYZ as seen in rows 1, 2, and 3.
Extract these sequences into a new column.
Then I would like to run an analysis of the prevalence of each extracted sequence in the original column. In the simple four row example I provided, XYZ would be prevalent in 75% (3/4) of records.
Finally I would then like to identify all rows containing strings where there is no other row containing that sequence (ie completely unique).