Knime & GitHub - Agile ETL-Collaboration, Solid Backups, and Effortless Workspace Synchronisation

Hi Knimers,

after bringing you the backup workflow, where I promised to create a “How to use GitHub with Knime”, I am proud to share with you my 5-Minute setup guide to harness the synergies of marrying Knime with GitHub.

:question:Fear overhauling your Knime workflow?
:white_check_mark: Just create a branch!

:question:Agile collaboration at scale?
:white_check_mark: Just invite as many collaborators as you want!

:question:Storage concerns?
:white_check_mark: Amazon AWS, Google Compute Cloud, Microsoft Azure, or your own to the rescue!

You can find the 5-Minute Setup guide here:

Or checkout my GitHub Repository here:

Agile ETL-Collaboration, solid backups, and effortless workspace synchronisation - company-wide, affordable, and in just five minutes! I’d love to hear your thoughts! Or, let’s connect at the Knime Spring Summit next week?!?

Enjoy
Mike

3 Likes

Thanks Mike! As always, super cool approach!

1 Like

Any experience solving merge conflicts when collaborating on a workflow?

Hi @dirkschumacher,

I haven’t faced that situation before but I’d assume you have two options:

  1. Decide which source takes precedence
  2. Try a selective merge by comparing the two open workflows and go through node-folder by node-folder using the node id as guidance.

I planned to execute some real collaboration tests where one workflow is open in parallel on two machines in the near future to elaborate the pros, cons, do and dont’s with the Github integration.

Are you facing some serious problems or are you just playing around?

Cheers
Mike

Hi @mwiegand,

We are currently discussing the KNIME-GitLab combination and I’m wondering whether this is really a suitable solution.
Pressing the save button in the AP creates a consistent set of files that for sure match a workflow that will be accepted by the AP later on. Merging could require manual edit of files that are not meant to be edited manually. Does it always result in a workflow interpretable by AP? I don’t know.
Defining a “master source” I would not call collaboration. The idea of having merging sessions with all contributors sharing their AP and trying to agree on a common configuration isn’t appealing either. At least if there are more than two of them.
At the moment I’d say that it’ll work for one person and might be acceptable for two. I’m looking forward to seeing the results of your collaboration tests.

Kind regards
Dirk

Hi Dirk,

Yes, I’d also agree to that with the extension of establishing clear collaboration guidlines. In my repository I’ve got a script that auto commits, via amend, and pushed the knime lock files of each workflow.

Though, as you concluded, the current stage is more meant to allow a flexible backup and synchronisation without user or space limits. The collaboration aspects will be tackled in the not so distant future.

Cheers
Mike

Hi,

even if it wouldn’t be usable in large scale at the end, the option to use it as a single user is very helpful. Thanks for exploring that field!
By the way, you use Git LFS in your repository. Just in case storing larger data files in the repo as well?

Kind regards
Dirk

You are most welcome. Git LFS I use for severral reasons:

  1. Some data, like results saved in binary format, I’d not push to git
  2. Performance reasons
  3. (Data) Safety
  4. Not requiring any sort of versioning of non-volatile data
  5. Using LFS just as a temporary storage with the ability to discard it / start anew

My current setup looks as follows:

  • Workspace on separate exFAT disk
  • Non-volatile data, like downloaded log files, saved in separate “Knime-Data” folder
  • Large non-volatile data, like shapefiles of several GB in size, saved in a separate drive managed by Mountain Duck which sync them to AWS S3 and keeps the data in sync across different workstations

My motivation behind that appraoch is to:

  1. Keep the Workspace size as small as possible
  2. Sync temporarily generated data via LFS
  3. Be able to start, stop, sync and resume while on the go

User scenario example:

  1. Local machine with plenty of resources
  2. Mobile laptop with limited resources
  3. Team members to whom I can easily share sepearate workspaces, w/ or w/o data
  4. Everyone can pick up work at any given point in time, commit and another can pick up
  5. Backup, restore, document etc. i.e. to mitigate update issues where even Windows can corrupt your data or, as experienced myself, a Knime update can cause workflow corruption

Happy kniming :wink:

Best
Mike

1 Like