Parsing EDIFACT files

gentile · February 8, 2022, 10:25am

Hi everyone,

I have a few thousand EDIFACT files that I need to parse.
For those who don’t know what they look like: they contain 3-character segments (like DTM for date and time, NAD for name and address, etc). The segments contain data separated by plus and colon (their position is defined by the segment) and they are delimited by a quote mark to separate from the next segment.

The raw data looks like this:

UNB+UNOC:3+CUSTOMER+SUPPLIER+220206:0255+000000416'UNH+1+DELFOR:D:04A:UN:GAVB10'BGM+241::6:ANY+000000416+9'DTM+137:20220206:102'DTM+2'NAD+BY+BUYER::92'NAD+SE+SELLER CODE::92++SELLER NAME+SELLER STREET+SELLER CITY++POSTCODE+COUNTRY'...

After segmenting it for better readibility:

grafik

I would like to be able to extract a few segment names to column headers and extract part of the segment content as table rows. Like this:

BGM	DTM	NAD_BY	NAD_SE
000000416	20220206	BUYER CODE	SELLER CODE

I used the file reader to separate the segments into columns by using the quote as column delimiter, but I’m stuck now.

How do I get the segment names as headers and retrieve dedicated parts of the segments as rows?

Any help is highly appreciated. Thanks.

mlauber71 · February 8, 2022, 10:39am

@gentile you might try something along these lines:

Maybe you could provide us with a few files that would represent the full spectrum of your challenge and also a file showing the expected results.

goodvirus · February 8, 2022, 11:06am

Hi @gentile,

i don’t think KNIME has a edifact parser and it would be quite a task to write a real praser with just knime notes, because edifact has some nesting in it (i worked with it in the german utilities sector).
I would advise to parse it via a python package and get the output a python script note.
Fore example you could use https://github.com/nerdocs/pydifact or https://github.com/php-edifact/edifact

Best Regards,

Paul

mlauber71 · February 8, 2022, 11:15am

@goodvirus that sounds like a very good idea

gentile · February 8, 2022, 11:39am

Thank you. I will take a look at those entries and see if they show me the light
Are you still interested to see a few “real” files?

gentile · February 8, 2022, 11:46am

You are right: the nesting in EDIFACT can be a challenge. Maybe transforming EDIFACT to JSON as a first step could help with that.
Thanks for the hint with python: I had found those packages too, but I have hardly any experience with it and don’t even know how to implement them or make them do something

mlauber71 · February 8, 2022, 12:32pm

@gentile I could try and use the Python package on it and see if I can find a working example.

About KNIME and Python. If you add Python to your KNIME set it will greatly enhance your capabilities:

https://docs.knime.com/latest/python_installation_guide/index.html#_introduction

It might be a bit of a challange first but once you have it set up it opens the world of Python for you

gentile · February 8, 2022, 3:01pm

Sounds like the promised land
Guess I need to give python a chance and see what happens.

Please find attached two calloff files: one of them containing one message (described by the UNH to UNT loop) and the other one containing multiple messages. Each message contains demand dates and quantities for one material number. Dates are specific days or a period.
Basically every message tells the vendor when they are required to ship a specific material to their customer.

220207_025514.7fdd40cb-ffa4-4f87-92e7-c19b0093b9be_mod.txt (2.9 KB)
220207_025514.911ede23-62b9-4da1-8f59-f66382c3674c_mod.txt (9.1 KB)

The task is to list the dates and periods for a material to better understand changes that can occur from one calloff to another.

For better understanding I tried to describe the looping within the file here:

LIN contains the material number.
Actual demand is in QTY+113 with DTM+2 under SCC+24.
Forecast demand (described by a period) is in QTY+113 with DTM+64 (start) and DTM+63 (end), under SCC+4.

The output would be something like this:

NAD_SE	LIN	SCC_24_QTY	SCC_24_DTM	SCC_4_QTY	SCC_4_DTM
SELLER CODE	7915288-02	2304	20220207	1584	CW15
SELLER CODE	7915288-02	720	20220209	1728	CW16
SELLER CODE	7915288-02	432	20220211	2304	CW17
SELLER CODE	7915288-02	864	20220214	2016	CW18
SELLER CODE	7915288-02	720	20220216

Do you think this is feasible somehow?

mlauber71 · February 8, 2022, 4:28pm

@gentile I tried a first import and split up the messages with the Python library and imported them into KNIME (and Excel to show what they look like). Further work will have to be done to split the lines into the tables you want.

I am not sure if the Python package would also offer this options but it might be worth a try (I am not familiar with the format as to be able to feed some sort of pattern into the code). Otherwise you will have to identify the blocks in the data and transform them to your needs.

in the folder /script/ theer is a Jupyter notebook to try the code ‘pure’: kn_forum_39612_python_edifact_parse.ipynb

gentile · February 8, 2022, 4:51pm

That looks amazing! Thank you so much @mlauber71 for your time and effort!
I should be able to pick it up now and take it further

system · May 9, 2022, 4:51pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.