Practical use of N-Gram

I have a series of structured PDFs (filled out application templates)
I created an N-Gram. Now my aim would be to select some combination of words and get back the document containing those set. Is anyone willing to help? thx

Are you actually extracting N-grams from the pdfs’ text or are you manually creating an N-Gram like phrase which you want to use to search for occurrences in the individual pdfs? If the former, do you want to loop all of the N-grams through all of the pdfs?

I am extracting ngram from PDFs. I would like then to search just for 5/10 2or3-grams
image

This doesn’t really match your original post. Could you explain in some detail what exactly you are trying to do?

my bad, quite new here .

-I fed around 80 pdf to tika parser node
-created document column for each pdf
-erased punctuation
-created 3-grams

Now I have 3 objectives (1 done; 2 to be done)

  1. Create a word cloud of terms related to ESG topics
  • filter only 3-grams containing buzz words (transitional, climate, etc.)
  • filter out 3-grams that are coming from the template structure (therefore existing in all docs)
  • string to term
  • word cloud
    pic below
  1. (this is what I wanted to ask with my post) create a table having in rows the 3-grams (ordered in descendent corpus frequency) and in columns the 3 frequecies but also the specific pdfs. In this way I could see e.g. the 3-gram “Increase strategy ESG” has corpus frequency x, doc frequeny y, sentence frequency z and can be found in the doc a,b,c,e,g,etc.

  2. (new objective) I would like to create a vocabulary of buzz words (ESG related) and create a table in which I have the pdf in rows and the number of buzz word hits in column. the expectation is to rank the pdf based on how much ESG topics are described in them

In general as you might understand my aim is to analyse and visualize hom much ESG is “discussed/ described” in the pdfs
Hope is clearer thanks

It would be much easier to help you if you would share your current workflow. Make sure it includes the data or upload the data separately.

1 Like

I would need to create test data. maybe i will do tomorrow
but… what is not clear exactly ?

A written description is useful to understand what you want to, but without data its nearly impossible to figure out how to do it.

1 Like