Random numbers

Hello,

I do my first steps in KNIME, so apologies for my “kids” questions.

In order to anonymise some data, I want to delete an ID column from my data base and replace with a new column with randomly generated ID.

I found an info about node Random Number Assigner but I don’t find it in my KNIME repository.

Can somebody give me an indice how to find a solution ? How to replace a column by another one with a random numbers.

Thanks

@algo_angel welcome to the KNIME forum. There are several random number nodes for KNIME:

You could sort your data by such a random number and then use a Counter node and generate a RowID from that to have a pseudo-ID.

3 Likes

Thank you @mlauber71
I tried with random number assigner but the output is decimals.
There is a possibility to have INTEGER values?

@algo_angel you could round the number and convert it to integer

kn_forum_45106_random_number_integer.knwf (12.7 KB)

1 Like

The problem is that after round you will have a doublicates. For ex decimal of 100,20 and 100,25 will return 100. I need unique random integer numbers

@mlauber71 round duplicates the numbers
I

@algo_angel this is why I suggested the Counter, so you will have unique values that are sorted in a random way. It would work like this:

@algo_angel

Does the ordering of rows in the data table provide any identifiable information?

If not, then just add a sequential number to each row using the counter generator.

If the row ordering does provide information then you will need to resort the table:

  • Generate a column with a random number.
  • Sort the table on the random number.
  • Add a column with sequential numbers to each row.

If the IDs occur in several rows (e.g. for a customer or personal id). Then use a group node to create a table with unique IDs. Use one of the above methods to generate an anonymised ID. Then use a join to append the anonymised ID to the original table.

Other options include creating a one way hash on the ID (e.g. using UUIDs or MD5 checksum) to create a pseudononymised ID.

Hope that helps

1 Like

There is a node for that :slight_smile:

4 Likes

Ok but why no such option to generate from the start an INTEGER random number ? Why decimal is possible and integer not? Strange and angry

Yes, ordering of rows provides information what should be anonymised

There is a node in the Vernalis extension that seems to offer the option for integers. Question would be would they be unique.

Olala this node generates a powerful code. Nobody can guess :joy:
Maybe there are something less secret …

Yes
I tried already Vernalis but still don’t know how to add this to my data base

@algo_angel

Programming convention has always been that base random functions generate a random number as a decimal (or float) between 0 and 1(exclusive of 1).

If you want an integer random number you multiply by the random number by the integer spread, take the floor to convert to an integer and add the lowest integer to get a range of integer random numbers between a (inclusive) and b (exclusive).

integer_random_number = a + floor(random() * (b - a))

If you know the number of rows in your table you can then use that formula in a math node with b= number of rows and a=0 to generate an integer random number.

We have conventions, based on hard earned experience over many years, so that we make fewer mistakes and it is easier to share and understand each others code. It may seem frustrating at first, but becomes easier with time.

On your second point. Just generating a random identifier will not anonymise your table. You will need to sort it. In which case doing the sort first then adding a sequential number will achieve what you want.

2 Likes

Thank you much for such professional answer. I think I am ok now

1 Like

Thank you very much. I will sort first and after I will use counter generations.

1 Like

@algo_angel I think the way to sort and then use the counter is the best option, but just for the fun of it I tried to do this with the Vernalis random number node:slight_smile:

If your existing ID field already contains unique values, then you might want to try the Anonymization node. The node runs an SHA-1 hash function on the reference field and generates a new column with a unique hexadecimal hash result for each record. One feature of potential interest is that it offers a lookup function you can keep separate from the published results, in case a justifiable need arises to be able to look up a record’s pedigree.

Example output column:
STUDENT_ID
b21465b509bf6d5b10477479376aec892a6bddf8
3cc9c7c399cf57eacb4c54ec8dc09d2818409a97
0127e961faa0186c9487b3840bd1bf2ce8bd24db
9293d91104e6c3b53925128a552b0366978ee449
802e012abf41accc14f8cda31abb85d470844a57

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.