Several new data types will be useful in KNIME. Based on having implemented a molecular spreadsheet while at Tripos, here are a few suggestions. Your feedback is requested!
Basic data types:
- A double vector, such as would be useful in a time series like mass spec
A 2-D double array, such as multiple-valued time series.
A double 3-D matrix, such as GRID or other programs create to represent a molecular field.
A "url" data type that retrieves information directly from the internet.
A "model" datatype, such as a pharmacophore description. This is complicated but useful - to manage multiple model attempts as parameters are varied, for example.
Column information:
- Columns need ability to write/read arbitrary text tags as pairs. This enables addition of info such as "what did I use to generate this fingerprint column".
Columns need special tags that are basically subtypes, such as "explicit bitset", "RLE encoded bitset","binary blob bitset" and so on. Is this better than having multiple datatypes that can be cast to a common type? (Maybe not...)
A list of supported distance measures between column members (Euclidean, Tanimoto, Cosine etc), tied to class info (this implies new classes for a DataType, btw).
Other:
- Tags for tables to annotate them with date/time/version and more.
Tags for individual cells, for example to flag marginally-valid data.
Means to connect columns that belong together, such as values with associated error bars.
Perhaps the least obvious aspect about tags is that we will need general ability to add, read, write tags without understanding their contents so that all relevant information is preserved for nodes that do know how to interpret the tags. They allow a level of generality that will prove valuable... how to best implement creation of tags (think "Post-It notes") is not at all clear, however, so this note is intended more to provoke discussion than to assert a solution.
It seems important to create and share common DataTypes and other conventions such as distance methods soon, to avoid ending up with multiple blobs that are mutually unintelligible.
[/][/][/]