Zero-based vs. one-based row indexing

Dear Knimers,

  • The “Table creator”, “Empty Table Creator” RowIDs start with “Row0”, “Row1”, “Row2”
  • To take the first three rows with a Row Filter one has to start with “1” and end with “3”.
  • When taking the first three rows with a Row Filter by using flow variables to set the range, one has to use “0” and “2”!
  • When pointing at the third row in a node output window, the tooltip says “3”
  • The ROWINDEX in the Math, String Manipulation, Java Snippet nodes start with 0, 1, 2
  • The ROWINDEX in the Rule-based Row Filter starts with 1, 2, 3…

KNIME is cool, but it must be Dijkstra’s worst nightmare :smile: or what exactly is the logic here?

Best,
Aswin

3 Likes
  • The Column Expressions node rowIndex() function starts with 1, 2, 3, … :crazy_face:

It was done intentionally to provide an endless source of interview questions, and to give knimers something to talk about at parties… :joy:

1 Like
  • The “Row Filter (Labs)” uses 0, 1, 2…

It seems as if there is some type of trench war going on within the KNIME offices between the zero-based and the one-based crowds :smile:

1 Like

It’s like the difference in floor numbering schemes between the UK and the US… which confuses tourists from both countries… :thinking::grin:

In the UK floor 1, or the 1st floor is the first floor above the ground (G) floor, whereas in the US, the ground floor, or main lobby, is also the first floor (floor 1), so the first floor above that is the second floor… And so on. Which scheme does the rest of the world use? I can’t recall…

1 Like

Since KNIME HQ is in Zurich, maybe we can abolish row index numbers completely, and use the Swiss way of floor numbering: E (for Erdgeschoss, ground floor), F, G, H… problem solved :bulb:

1 Like

Just one more thing…

Jokes aside, I think Knime should unify around one-based row indexing. The case for zero-based indexing, basically that it corresponds nicely to the “distance to the start of the list”, and that it is convenient when working with pointers, is not very useful in Knime from a user’s perspective. It is anyway hard or impossible to refer to other rows than the current one due to the row-oriented nature of Knime. One-based indexing is the only option that makes sense in nodes such as the often-used Row Filter node. Other less-used nodes should use the same convention to avoid confusion.

I don’t think it is a coincidence that a new “software engineering” language such as Go uses zero-based indexing, while a new “numerical analysis” language such as Julia uses one-based indexing.

Best
Aswin

2 Likes

I agree @Aswin. My first real experience of programming was on a ZX81 in the 80s where arrays were one-based. After that, just about every language I’ve used professionally had data arrays which have been zero (offset) based. The slight exception iirc was VB3 which let you choose which you wanted for your particular application!

So offset from start (with zero as the first element) is what I am used to as programmer, and yet it still came as a surprise to me when I first used Knime, only to discover the label “Row0”! That just felt odd, since a row feels like a physical construct rather than an abstract memory construct and so in the real world (and especially spreadsheets!) I’m used to it starting at 1.

I hadn’t really thought about the inconsistency though until you raised it here.

Knime of course treads an interesting line between high level “no code” workflows and allowing some powerful techie “under the hood” scripting, and I think that “split personality” (or maybe I should more kindly refer to it as “dual-personality”) does show through a little at times.

The trouble now of course is that either we just accept that this idiosyncratic issue is what it is and live with it, or it gets changed.

But if it gets changed they (Knime) would have to ensure backward compatibility, and that would be an interesting challenge. (not one I’d like to be responsible for!)

Of course an environmental switch (like VB3) could allow the user to set the “base index” on a workflow-by-workflow basis (or set it as maybe running in “compatability mode”) but I can see that presenting issues and confusion especially where workflows are so widely shared and there is such good collaboration as there is here, and that small row number translation could also have a noticeable performance impact on large data sets.

I also dread to think how much existing code on the Knime AP and Knime Server software would have to be inspected and touched to enable such a change.

So yes, I’m supportive of the idea you raise, and have found it useful that you’ve raised it, but I’m not going to hold my breath! :wink:

2 Likes

Hi @aswin and @takbb -

Without getting too far into the weeds, I wanted to let you know that we do have tickets open to address some of the indexing inconsistencies identified in the original post (AP-13692, AP-11609). In the future these issues are likely to be addressed by both deprecating some existing nodes and promoting some other nodes out of Labs. But as you mentioned, we have to tread carefully to make sure we don’t break backwards compatibility, which is always a high priority for us. I don’t have an ETA on a fix but we are definitely aware of these items.

Thanks for the feedback and good discussion!

5 Likes

Just my 5 cents. Initially program languages (Algol, PL/1 so on) indexed from 1. This is the way people think. But, for a better efficiency of loops, it was changed to 0 just at the time you started. Now we have process of programming democratization. That means that inside index will be 0 but on a presentation level 1 with some efficiency lost.

1 Like

Dear Knimers,

I am reluctant to come back to the same subject again, and if there would be an option to not bump this thread to the top of the forum I would click it.

However, for completeness sake, I would like to mention that column numbering has the same problem:

  • The “Split Collection Column” node generates “Split Value 1”, “Split Value 2”, “Split Value 3” column names.
  • From the string column “col”, the “Cell Splitter” node generates “col_Arr[0]”, “col_Arr[1]”, “col_Arr[2]” column names.
  • The default column naming of the “Table Creator” is “column1”, “column2”, “column3”.
  • From the column “col” the “Column to Grid” node generates “col (0)”, "col (1), “col (2)”.
  • The “Extract Column Header” node generates “Column 0”, “Column 1”, “Column 2” column names.
  • For the “Column Expressions” node, in contrast with the row indexing which is one-based, the column indexing is zero-based.

Also the formatting of the column name string is different each time…

Best
Aswin

2 Likes

I don’t think you should be sorry to bring it up again @Aswin … I think it’s actually useful to have a reference for where the inconsistencies are.

I only really noticed the Table Creator vs Extract Column Header numbering today, when I thought it should be a straightforward task to just rename my manually created columns on the fly… :crazy_face: Actually it was still relatively straightforward and just needed a String Manipulation, but still it would obviously be better if it all just clicked effortlessly into place.

As for column splitting, I think the introduction of an option on such nodes where you can choose the numbering scheme (or at least the starting number) would make sense, just as was done with the Counter Generation node.

Meanwhile back on the row indexing… I would imagine for those nodes where they have variables such as ROWINDEX which are zero-based in some places and 1 based in others, it would be best to keep these to ensure backward compatibility but to introduce across the board new variables such as ROWINDEX0 and ROWINDEX1 (or something like that, which are effectively zero and one-based indexes consistently available throughout.

1 Like