Structure Interaction fingerprints are fingerprints that describe interactions of a molecule/drug with different parts/molecules of a protein called amino acids. A typical sift may look like this once parsed into a csv like format. 1st row is the header
Each position out of 8 has a particular meaning.
I am attempting to import such a file using the csv,file reader, xls node and cluster. Note that one may either combine the 3 fingerprints or use them separately.
Issue#1: The bitstring with eight zeros is always imported as a zero since it is considered numeric. Thus when numbers are converted to a bitstring using bitstring generator, they are of unequal length, which may affect clustering results.
Issue#2: If the above bitstrings are modified to have quotes so as to declare them as strings, they do get imported properly, but then it becomes a hard task to merge them and form a bit vector. Needs several operations and not very satisfactory.
Issue#3: Excel cells were formatted as type Text from the format menu. Double quotes were removed. On reopening such a file excel/openoffice have no issue showing them as strings/text. Thus the sequence of eight 0's is shown correctly. However importing this excel file into knime again turns them into single 0's. Knime does not consider the metainfo about cell type. Nor does it have a dialog to modify individual column types while importing.
I posted this here and not cheminformatics as its a general issue with bitvectors.
I think you will have the best luck importing your bitvectors as strings. Subsequently, you can create various combinations of bitvectors using the String Manipulation node and finally convert the desired strings into a bitvector type using the BitVectorGenerator node. It shouldn't be more than a small handull of nodes to do all of this and it should be fairly fast. Do you have any suggestions for making this process more satisfying?
According to the node description is does get the type from the excel file: It reads in the data from the sheet and sets a type for all columns that is compatible with the data in that column (in the worst case "String" covers all).
I can't replicate this issue. I've done the following
Column 1: The valuesa are represented at '00000000 (the ' fixes as text)
Column 2: The values are put in quotes "00000000"
Column 3: The values are set as a text cell
Read into KNIME:
1) Import using xls reader, all the columns were read correctly and nothing gets converted to an int columns
2) Read using a file reader. I manually had to set the columns to string but they all read correctly.
The BitVector generator works on both column 1 and column 3 without issue.
I've uplaoded the workflow and the files I used. My Version of KNIME is 2.7.2. Hopefully they work for you as well. I had to zip the csv and xlsx as the website won't accept either of these file types.
Thanks guys. Yes it does import columns as strings if i put 00000000 in quotes, single or double, since they are imported as strings.
I would prefer that single or double quotes themselves aren't kept after importing columns. This retention of quotes in the columns/cells means one has to use the string replacer node to remove them before merging the columns to create 1 string and then converting this merged column to a bitvector.
I will use the attached workflows and test it on my version of knime. thanks again.