Columnar Backend: Caching the next row to display failed. This is an implementation error.

mwiegand · August 14, 2023, 9:24am

Hi,

I stumbled across this error which did not appear using the default row based backend. There is another error thrown before which did make no sense to me as the XPath Config did not exported anything of the type int but only node cells, though. The Xpath also comes right after the node which did throw the error in question but I include it just for the greater picture.

ERROR XPath                3:1178     Execute failed: class org.knime.core.data.xml.XMLCell cannot be cast to class java.lang.Integer (org.knime.core.data.xml.XMLCell is in unnamed module of loader org.eclipse.osgi.internal.loader.EquinoxClassLoader @29532e91; java.lang.Integer is in module java.base of loader 'bootstrap')
ERROR XPath                3:1178     Execute failed: class org.knime.core.data.xml.XMLCell cannot be cast to class java.lang.Integer (org.knime.core.data.xml.XMLCell is in unnamed module of loader org.eclipse.osgi.internal.loader.EquinoxClassLoader @29532e91; java.lang.Integer is in module java.base of loader 'bootstrap')
ERROR String To XML        3:1177     Caching the next row to display failed. This is an implementation error.

Error with table preview

My attempts to reproduce the issue were not successful but maybe anyone of the Knime team can draw some conclusions.

Likely unrelated XPath error just for completeness

Best
Mike

ScottF · August 22, 2023, 8:23pm

Thanks for the screenshots and documentation of the issue. I’ll see if I can get someone more well versed in the backend than I am to take a look.

mwiegand · August 25, 2023, 7:55am

Good morning @ScottF,

I again ran into the issue but cannot resolve it this time by resetting the failing nodes. However, after saving the table data and attempting to reproduce it in a separate workflow, I was not able to reproduce it.

I guess, though, this is somehow related to the workflow in which the error were thrown.

Note: This seems to be not related to the columnar backend and also not exclusive to the XPath node either:

It rather seems to originate in the String to XML Node. IN the above screenshot I attempted to narrow the cause down assuming something in the data is not correct. But no matter what I try, subsequent nodes keep failing.

Cheers
Mike

mwiegand · August 29, 2023, 7:54am

Hi,

there are two bugs in this ticket which are becoming quite a blocker.

Caching the next row to display failed. This is an implementation error.
This causes the cell content to not be parsed at all resulting in erroneous data. It was reproducible while converting JSON to XML too.

Interestingly, when I try to narrow down the cause to a specific data cell by looping over each single cell individually, it works.

Proceeding with the extraction does then as well which indicated the String / JSON to XML Nodes fail to cope with large amount of data.

The really odd part is, I am only partially able to reproduce it. Ignore the spagetti wiring for a moment. IN the workflow to the left it consistently fails. The one to the right with the exact same data, does not.

I start to believe, the bug must be thrown once in order to reproduce it. But then it is only possible to do so in the workflow where it failed initially.

Here is the sample data. IMPORTANT: Remove the txt extension to convert it to rar!
json-to-xml-java-lang-Integer-error.rar.txt (1.1 MB)

Here is the workflow. Even though,it might not be reproducible but maybe you spot anything:

EDIT: Trying to use the fallback, I was not able to experience the issue on the Group Loop Start node as well which didn’t happen before. It’s almost as if the error propagates from one to other nodes.

Edit 2: Upon resetting and re-executing I not get “ERROR Group Loop Start 3:1262 Execute failed: (“ClassCastException”): null”

Apologize to be frank but what the heck is going on with Knime since 5.1? This never happened before and I worked with that particular data in previous versions already.

Edit 3: Right clicking on the Loop Start Node and just executing that, resolved the error

However, trying to reproduce it again resulted in constant failure

Edit 4: Letting some time pass, grabbing a coffee to assess my options, I came back, executed the failing loop start again and … et voila, it worked again. I did absolutely nothing and it worked. Totally unpredictable

Best
Mike

carstenhaubold · August 29, 2023, 4:14pm

Hi Mike!

Phew, you found a nasty one there. Thank you for the investigation!

Some infos to explain why you’re seeing these problems now: We have added full-fledged support for XML, PNG and JSON cells in the Columnar Backend. This means data is serialized to disk a little differently, and enables access to these data types also from Python code.

Tech detail 1: all of these data types are either stored in the table directly (if the data is small, here JSONCell), or externalized into separate files. This decision happens on a per-cell basis. The way how data is externalized to separate files has changed between the row-based and the Columnar Backend.

Tech detail 2: there are caches in place that make sure we can process the data in the next node without having to wait for the table of previous nodes being saved to disk.

What you were seeing was most probably a bad combination of caches and a column with both small and large XML or JSON cells. When iterating over the table, the cache didn’t notice that a row contains a “small” value and not a large one, and thus we tried reading data in the wrong format.

At least that is what I have been able to reproduce, investigate, and am preparing a fix. That should be available in the nightly in a few days. If you could confirm that the problem is resolved for you once the fix becomes available, that would be awesome!

In the meantime, the row-based backend should not be affected by these issues.

Sorry for the inconvenience and thanks again for the detailed report!
Carsten

mwiegand · August 29, 2023, 5:55pm

Hi @carstenhaubold,

I have to happen a hand for finding nasty things Thanks for taking the time to investigate and don‘t worry about the struggle. Sometimes these can be great motivators. Though, that one was exhausting for sure.

What you mentioned about bad caches reminds me of another issue I raised, or at least I believe I did some time ago, that the preview erratically wasn‘t updated. I don‘t want to inflate the topic at hand but in the greater picture we might be able to „kill two birds with one atone“ (sure this idiom is more violent the the German counterpart).

I send a screen recording if the issue, as it impacted different nodes, to Daniel. Uncertain if you have seen it, it might reveal more details to the cause you managed to pin down.

About testing, unfortunately I am facing two issues here as well (unable to download & unable to start because of legacy Java coming from nowhere) where I already raised a topic.

If I am able to resolve these, I‘m happy to try to verify if the issue got resolved.

Thanks again and have a nice „Feierabend“
Mike

carstenhaubold · August 30, 2023, 5:02am

Ah right, the weird download issue. That one still eludes me

But regarding the screen recording (yes, Daniel showed me): this could have been caused by the same issue. Saving the file, as well as cache cleanups and/or garbage collector runs could have dropped the values from the cache. Then the JSON cells need be loaded from disk, and while reading we correctly notice whether the JSON content is stored in the table directly or externalized. At least I hope that this alleviates that problem, too

Cheers,
Carsten

mwiegand · August 30, 2023, 5:47am

Two thoughts on this:

Alongside the garbage collection button, wouldn’t it be a nice idea to add an option to regenerate either all previews / caches or on a per node basis? Some sort of soft reset so to say. I wonder if, i.e. via MD5 hashes, it actually would be possible to determine if validity as well to automate the process which feels quite complex, though. No complain but with the number of gears, chances of failures rarely increase linear but, my observation, sometimes exponentially.
In case of large data sets these background progresses, like the save action, can take quite some time. Would it be possible to make them visible in the task manager as sub-processes? It would help debug the recently perceived “lagginess” or, if even possible, kill that sub-service or help debugging.

Cheers
Mike