Creating custom tags

Hi, I want to parse a pdf document which I have read with the Tika Parser. I converted the column content to document and want to extract now different tags like this:
#RaTopic string
#RaTeamMember list of string
#RaStartDate date
#RaQFeedback bool
#RaLocation enum

Typically, the text contains Tags in this format: #Tag: “Tag Content” if it is of type string, date, enum or in the case of bool only the tag is mentioned

How can I quickly parse that?

Hello @knimerin
I’ve understood that you want to extract string sequences -Content- related to a ‘tagged’ text by position, in most cases they are related by a space character: [#Tag ‘Content’ …] . Then generate some structured data related to these tagging. Is it?

Is your use case composed by a single PDF document? or is it a set of PDF standardized documents?

In any of these cases, you don’t need to work on a ‘document’ type column. The more sensible approach would be to extract the ‘Content’ fields with regex queries from a text column.

You can take a look into the following workflow (only the PDF digest section); Analysis section isn’t relevant for your use case. Column Expressions can be suitable to code the regex extract query as in the example.

BR

1 Like

Perfect, tried that but I am really struggling with the regexp. I have following text:

 #RaStartDate "DD.MM.YYYY"
 [Date – Change1 – if necessary]
 [Date – Change2 – if necessary]
 #RaEndDate "DD.MM.YYYY"

#RaTitle "[Topic]"

and I tried to extract e. g. #RaTitle with:

regexReplace(
    column("Content")
    , "#RaTitle[\\s]{1,}\"(.*)\"[\\n\\S\\s]{1,}"
    , "$1"
    )

Didn’t work. Furthermore I can’t manage to extract the dates. Any hints?

Hello @knimerin
Thank you for your answer and the sample text attached.

I’ve tried to reproduce your use case in this workflow. Be aware that regex coding is very sensitive to text characters, text position… so extended sample text or real data may be needed to deliver a more robust coding.

regexReplace(
    column("Content")
    , "[\\n\\S\\s]{1,}(?:#RaTitle[\\s]+[\\\"]+(.*)[\\\"])([\\n\\S\\s]{1,})?"
    , "$1"
    )

BR :vulcan_salute:t3:

2 Likes