Sample GL Extract.xlsx (9.1 KB) CSV is Somewhat Space Delimitted
I have attached a CSV extract of General Ledger data, which is sampled to demonstrate the structure. The actual file has approximately 30,000 rows. It is space delimited, but I am struggling with lining up the data rows to the actual columns. The columns I am most interested in are transaction date (TrxDate), Journal Number (JrnlNo.), the $ amount (either Debit or Credit), and then everything else can go into a Comments or Notes field.
The first row has has a line that highlights the columns:
TrxDate JrnlNo. Orig.AuditTrailDistributionReference Orig.MasterNumber Orig.MasterName Debit Credit
The data rows then look like any one of these:
1/1/2021 201,543 CLRRX00020153 $2.32
1/1/2021 202,543 QMTRX9999369 Accounts Payable TELEPHONE EXP Smith, Bob $50.00
1/1/2021 201,545 DDDV00030999 OriginalCompany $2,022.00
Current Attempts
I’ve tried using Cell Splitter, but I’m still a monkey playing with sticks here.
Any help or advice from this fantastic community would be appreciated.
Didn’t work out of the box. Trying different configs on the regex, but perhaps I should do two regex’s - one on the first row, and then one on the rest.
Then, somehow, I should pull the data columns into the correct column.
@wisemanleo the current file is a Excel File. Would it be possible to share the actual CSV file to see the enconding and the separators as they would be in your task?
In fact, it was just a tip. I’ve overwritten it with a more complete (I hope) solution. It seems to work for the 12 rows of the Excel file (provided my interpretation of their structure is correct)
The regular expression I use is ^([^ ]+) ([^ ]+) ([A-Z0-9]+)(.+?)\$([^ ]+ +)*([^ ]+)*$
^ means “start of the row”
([^ ]+) ([^ ]+) ([A-Z0-9]+): this part of the regex captures 3 distinct groups of characters (TrxDate, JrnlNo., Orig.AuditTrailDistributionReference) which don’t contain spaces (in fact, the expression for the third group is overspecified)
(.+?)\$: the fourth group captures every character preceding the “$”. I don’t know how to split this text.
([^ ]+ +)*([^ ]+)*: the “$” is followed either by a debit and a space (separating it from the missing “Credit” value) or by a credit, followed by nothing. If you specify both groups, each one followed by an asterisk (meaning “appears 0 or more times”), you will always capture a number (hopefully) assigned to the right column and a missing value.
[ In fact, ([\d\.,]+ +)*([\d\.,]+)* is less prone to error, so I’ve edited the Regex node and the amended expression is ^([^ ]+) ([^ ]+) ([A-Z0-9]+)(.+?)$([\d\.,]+ +)*([\d\.,]+)*$ ]
$ means “end of the row”
I hope you find this workflow useful. Needless to say, its correctness strictly depends on my assumptions