Column Expression: Dynamically Extract and Define Type

mwiegand · January 7, 2023, 1:21pm

Hi,

picking up my original topic:

I want to extract the column type to save this alongside i.e. a CSV file in order to once after reading CSV again, reinstate the original column type(s). Reason for that is that data might get shared / transmitted in human readable text files types. But, reinterpreting column types is time consuming and failure prone.

I am trying to accomplish this by looping over the table of data types derived from aforementioned workflow.

Binary Object	org.knime.core.data.blob.BinaryObjectDataCell
Bit vector	org.knime.core.data.vector.bitvector.DenseBitVectorCell
Bit vector (sparse)	org.knime.core.data.vector.bitvector.SparseBitVectorCell
Boolean Value	org.knime.core.data.def.BooleanCell
Byte vector	org.knime.core.data.vector.bytevector.DenseByteVectorCell
Duration	org.knime.core.data.time.duration.DurationCell
Image	org.knime.knip.base.data.img.ImgPlusCell
JSON	org.knime.core.data.json.JSONCell
Local Date	org.knime.core.data.time.localdate.LocalDateCell
Local Date Time	org.knime.core.data.time.localdatetime.LocalDateTimeCell
Local Time	org.knime.core.data.time.localtime.LocalTimeCell
Number (double)	org.knime.core.data.def.DoubleCell
Number (integer)	org.knime.core.data.def.IntCell
Number (long)	org.knime.core.data.def.LongCell
PMML	org.knime.core.data.def.LongCell
PNG Image	org.knime.core.data.image.png.PNGImageCell
Path	org.knime.filehandling.core.data.location.cell.SimpleFSLocationCell
Period	org.knime.core.data.time.period.PeriodCell
RDKit Molecule	org.rdkit.knime.types.RDKitMolCell2
SVG Image	org.knime.base.data.xml.SvgCell
String	org.knime.core.data.def.StringCell
URI	org.knime.core.data.uri.URIDataCell
XML	org.knime.core.data.xml.XMLCell
Zone Date Time	org.knime.core.data.time.zoneddatetime.ZonedDateTimeCell

Not really an elegant but more a brute force approach (due lack of ideas currently). However, it seems the type can not be managed through variables, can it?

Best
Mike

mlauber71 · January 7, 2023, 1:33pm

If you absolutely must have text files that are still readable you could try and use the ARFF format which is also supported by KNIME

mwiegand · January 7, 2023, 1:55pm

Nice node but it’s not quite cutting it:

WARN ARFF Writer 5:1708 Class List (Collection of: String) not supported.

There are some complex data types I am curious they can exist outside of Knime and still being readable. Adding more to my explantion / the reason why, not everyone will or can use Knime. Having a human readable but still interchangeable format would make adoption easier. Some only trust, understand and therefore support what they can actually see.

Anyways, sometimes it’s necessary to leverage meta information such as the column type or date time format in order to properly make systems / processes understand each other. Having to guess / try & error each time, is one of the root causes of potentially nice solutions being never adopted.

mlauber71 · January 7, 2023, 8:48pm

@mwiegand if we are into interesting (re-)engineerings concerning text files I can add the case of a MySQL Dump being imported into H2 …

I think the most transferable file format still is Parquet. Another one might be ORC though with KNIME not exactly as flexible. Also SQLite and H2 are quite common to exchange information. But admittedly they are not text-only. KNIME itself supports a very wider arrange of data formats that you will not see in most other applications.

mwiegand · January 8, 2023, 7:17am

Good morning @mlauber71,

that sounds like a nightmare but pretty much falls into the nice exercise of “will it blend”

The Parquet nodes I became aware about recently too and thanks for the reminder. It feels like these are one if not the only possible solution. Got to familiarize with Parquet more. Will check ORC as well.

If I find a solution to ensure column types are preserved as some sort of meta file, I will circle back.

Best
Mike

mwiegand · January 8, 2023, 11:15am

Unfortunately both nodes do not write a human readable format nor could they be leveraged to extract the meta information from each column type.

The ARFF Writer does (yet) not support paths (to point to storage location, column type untested) and the column type Collection. However, it looks more promising as it’s kind of human readable

Unfortunately, the writer node seems to miss proper quote treatment causing the ARFF Reader to break:
ERROR ARFF Reader 5:1726 Configure failed (TokenizerException): New line in quoted string (or closing quote missing). In line 9.

Raised this ticket

Best
Mike

system · April 8, 2023, 11:15am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.