CSV Transformation with Inconsistent but Patterned Structure

Sample GL Extract.xlsx (9.1 KB)
CSV is Somewhat Space Delimitted
I have attached a CSV extract of General Ledger data, which is sampled to demonstrate the structure. The actual file has approximately 30,000 rows. It is space delimited, but I am struggling with lining up the data rows to the actual columns. The columns I am most interested in are transaction date (TrxDate), Journal Number (JrnlNo.), the $ amount (either Debit or Credit), and then everything else can go into a Comments or Notes field.

The first row has has a line that highlights the columns:
TrxDate JrnlNo. Orig.AuditTrailDistributionReference Orig.MasterNumber Orig.MasterName Debit Credit

The data rows then look like any one of these:
1/1/2021 201,543 CLRRX00020153 $2.32
1/1/2021 202,543 QMTRX9999369 Accounts Payable TELEPHONE EXP Smith, Bob $50.00
1/1/2021 201,545 DDDV00030999 OriginalCompany $2,022.00

Current Attempts
I’ve tried using Cell Splitter, but I’m still a monkey playing with sticks here.

Any help or advice from this fantastic community would be appreciated.

if the rows have a common pattern, you could try something like this read_csv_regex – KNIME Hub

1 Like

Didn’t work out of the box. Trying different configs on the regex, but perhaps I should do two regex’s - one on the first row, and then one on the rest.

Then, somehow, I should pull the data columns into the correct column.

@wisemanleo the current file is a Excel File. Would it be possible to share the actual CSV file to see the enconding and the separators as they would be in your task?

It seems that the forum doesn’t allow upload of CSV’s? In any case, the CSV is “1 column” and the data does look like this:

TrxDate JrnlNo. Orig.AuditTrailDistributionReference Orig.MasterNumber Orig.MasterName Debit Credit
1/1/2021 201,543 CLRRX00020153 $2.32
1/1/2021 202,543 QMTRX9999369 Accounts Payable TELEPHONE EXP Smith, Bob $50.00
1/1/2021 201,545 DDDV00030999 OriginalCompany $2,022.00

In fact, it was just a tip. I’ve overwritten it with a more complete (I hope) solution. It seems to work for the 12 rows of the Excel file (provided my interpretation of their structure is correct)

3 Likes

Genius - this seems to work, and even splits between debit credit. How did you manage that part?

2 Likes

The regular expression I use is ^([^ ]+) ([^ ]+) ([A-Z0-9]+)(.+?)\$([^ ]+ +)*([^ ]+)*$

^ means “start of the row”

([^ ]+) ([^ ]+) ([A-Z0-9]+): this part of the regex captures 3 distinct groups of characters (TrxDate, JrnlNo., Orig.AuditTrailDistributionReference) which don’t contain spaces (in fact, the expression for the third group is overspecified)

(.+?)\$: the fourth group captures every character preceding the “$”. I don’t know how to split this text.

([^ ]+ +)*([^ ]+)*: the “$” is followed either by a debit and a space (separating it from the missing “Credit” value) or by a credit, followed by nothing. If you specify both groups, each one followed by an asterisk (meaning “appears 0 or more times”), you will always capture a number (hopefully) assigned to the right column and a missing value.
[ In fact, ([\d\.,]+ +)*([\d\.,]+)* is less prone to error, so I’ve edited the Regex node and the amended expression is ^([^ ]+) ([^ ]+) ([A-Z0-9]+)(.+?)$([\d\.,]+ +)*([\d\.,]+)*$ ]

$ means “end of the row”

I hope you find this workflow useful. Needless to say, its correctness strictly depends on my assumptions

4 Likes

Apologies - rude of me not to respond back right away, but that was a legendary explanation. Will treat it as my welcome into the world of RegEx.

2 Likes

I’m glad I was able to help!

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.