Accessing Full Columns in Java using Arrow?

stellarpower · November 17, 2023, 6:58pm

Hi,

I am looking to perform some sliding windows over my data, with implementations of what I need to do in C++. From memory, I thought that KNIME was built on top of Apache Arrow for its backend (I stand corrected, as it seems this is an optional backing format that was added a few years ago), and from memory I thought I would be able to get access to the whole table in the Java Snippet node. So my plan was to unpack a pointer to the arrow columns, plus one for the output, and provide this to an entrypoint to my C++ code through a JAR library. This could then use Arrow to vectorise a function efficiently over the data. I know this is in a way how the Python integration works, providing the script with data as a pandas DataFrame or similar.

It seems like this may not be possible out of the box, but is there a way I can achieve this? Beyond writing a custom extension - I’d like to be able to rework the Java side as easily a I can with the snippet node, except rather than having the code execute inside the loop, have it able to see the whole table - or a page is probably fine if that is more suitable. I know KNIME can process data larger than main memory, so some sort of pointer towards the current row that would enable me to get more general access, but allow me to let KNIME handle the looping and respect its setup on that would be great.

Thanks!

carstenhaubold · November 28, 2023, 4:03pm

Hi @stellarpower,

Interesting that you want to use Arrow in a Java Snippet node, that’s the first time I hear of this request – most people who use Snippet nodes don’t want to mess with the low level details

Unfortunately we do not offer a direct API to access the underlying Arrow buffers. There is quite some machinery going on when exchanging data between Arrow and KNIME, some of these are e.g. translating KNIME types to Arrow, caching the data, etc.
Honestly I don’t think it is planned to provide low level Arrow access to Java Snippet users, or even Java KNIME node developers.

If you’re willing to go to Python, there you can access KNIME tables as Arrow (in both the Script node and in pure-Python nodes). This does not give you readable content for all KNIME data types (because some of them are Java-native) nor does it do any caching. But it gives you the low level access that you might want.

Hope that helps,
Carsten

stellarpower · December 1, 2023, 4:00pm

Thanks, yes, I have used Python as a stop-gap. Although am not a fan of it at all, and it does also mean reading in the full table at once. I didn’t think my workflow was particularly complicated, but during calculations my memory use went up to about 30G! I had not yet tried though just using it to call into C++. That would be one way to do it though.

I may have a look at the implementation then and see. I was wanting to perform sliding window calculations - whilst the snippet is a convenient interface, I agree, if it’s only possible to access one row at a time, this makes this harder, without then adding a whole load of extra columns to contain the window’s values. I was also going to look at modifying it to have an entry and exit point, so I could perform initialisation and cleanup in the C++ side.

Is it any easier if not accessing Arrow directly? and using whatever types KNIME has natively? I’m not attached to one or the other, just assumed Arrow might be the easier way in, as it has native support in other languages and I believe it can vectorise my functions for me so that they operate over the data without my having to loop explicitly. But if this can be done using KNIME’s own types and this is simpler then that would probably be the way to go. Alternatively, maybe the “scheduling” aspect should be handled totally in Java and it just calls in once per execution to C++. This would then be the equivalent of a “C++ Snippet” node - which may have some limitations in what can be done, but might be easier to duplicate and would be a start. I think this is more or less where I was planning on starting. I know there is a Sliding Window Node provided in the community, but from what I have seen this transposes the data to extra rows that waterfall one-by-one. Maybe just a JavaSnippet that windows over the data would be useful for others anyway, and a good place to start.Then I can add init and cleanup as members to override in there.

Been looking at other things recently, but if it ends up being useful again, I’ll happily publish an extra extension or open a Merge Request - or guess something else, as it seems the Github repository is a mirror of the source - is there a way to contribute to the core code?

Thanks!

system · February 29, 2024, 4:00pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.