Knime & GitHub - Agile ETL-Collaboration, Solid Backups, and Effortless Workspace Synchronisation

mwiegand · March 16, 2025, 5:32pm

Hi Knimers,

after bringing you the backup workflow, where I promised to create a “How to use GitHub with Knime”, I am proud to share with you my 5-Minute setup guide to harness the synergies of marrying Knime with GitHub.

Fear overhauling your Knime workflow?
Just create a branch!

Agile collaboration at scale?
Just invite as many collaborators as you want!

Storage concerns?
Amazon AWS, Google Compute Cloud, Microsoft Azure, or your own to the rescue!

You can find the 5-Minute Setup guide here:

Or checkout my GitHub Repository here:

Agile ETL-Collaboration, solid backups, and effortless workspace synchronisation - company-wide, affordable, and in just five minutes! I’d love to hear your thoughts! Or, let’s connect at the Knime Spring Summit next week?!?

Enjoy
Mike

ScottF · March 20, 2025, 8:07am

Thanks Mike! As always, super cool approach!

dirkschumacher · March 24, 2025, 2:07pm

Any experience solving merge conflicts when collaborating on a workflow?

mwiegand · March 24, 2025, 4:46pm

Hi @dirkschumacher,

I haven’t faced that situation before but I’d assume you have two options:

Decide which source takes precedence
Try a selective merge by comparing the two open workflows and go through node-folder by node-folder using the node id as guidance.

I planned to execute some real collaboration tests where one workflow is open in parallel on two machines in the near future to elaborate the pros, cons, do and dont’s with the Github integration.

Are you facing some serious problems or are you just playing around?

Cheers
Mike

dirkschumacher · March 25, 2025, 7:33am

Hi @mwiegand,

We are currently discussing the KNIME-GitLab combination and I’m wondering whether this is really a suitable solution.
Pressing the save button in the AP creates a consistent set of files that for sure match a workflow that will be accepted by the AP later on. Merging could require manual edit of files that are not meant to be edited manually. Does it always result in a workflow interpretable by AP? I don’t know.
Defining a “master source” I would not call collaboration. The idea of having merging sessions with all contributors sharing their AP and trying to agree on a common configuration isn’t appealing either. At least if there are more than two of them.
At the moment I’d say that it’ll work for one person and might be acceptable for two. I’m looking forward to seeing the results of your collaboration tests.

Kind regards
Dirk

mwiegand · March 25, 2025, 8:26am

Hi Dirk,

Yes, I’d also agree to that with the extension of establishing clear collaboration guidlines. In my repository I’ve got a script that auto commits, via amend, and pushed the knime lock files of each workflow.

Though, as you concluded, the current stage is more meant to allow a flexible backup and synchronisation without user or space limits. The collaboration aspects will be tackled in the not so distant future.

Cheers
Mike

dirkschumacher · March 25, 2025, 8:48am

Hi,

even if it wouldn’t be usable in large scale at the end, the option to use it as a single user is very helpful. Thanks for exploring that field!
By the way, you use Git LFS in your repository. Just in case storing larger data files in the repo as well?

Kind regards
Dirk

mwiegand · March 25, 2025, 9:10am

You are most welcome. Git LFS I use for severral reasons:

Some data, like results saved in binary format, I’d not push to git
Performance reasons
(Data) Safety
Not requiring any sort of versioning of non-volatile data
Using LFS just as a temporary storage with the ability to discard it / start anew

My current setup looks as follows:

Workspace on separate exFAT disk
Non-volatile data, like downloaded log files, saved in separate “Knime-Data” folder
Large non-volatile data, like shapefiles of several GB in size, saved in a separate drive managed by Mountain Duck which sync them to AWS S3 and keeps the data in sync across different workstations

My motivation behind that appraoch is to:

Keep the Workspace size as small as possible
Sync temporarily generated data via LFS
Be able to start, stop, sync and resume while on the go

User scenario example:

Local machine with plenty of resources
Mobile laptop with limited resources
Team members to whom I can easily share sepearate workspaces, w/ or w/o data
Everyone can pick up work at any given point in time, commit and another can pick up
Backup, restore, document etc. i.e. to mitigate update issues where even Windows can corrupt your data or, as experienced myself, a Knime update can cause workflow corruption

Happy kniming

Best
Mike