How to use Pubmed Central in Document grabber and extract full text?

sanatan · December 4, 2014, 5:19am

I was working with Pubmed mining for 3 months now, and i guess the PMC full text are a better source

There are 2 ways to do that : download all PMC = 50 GB and put it into a quereyable mnner, there the issue is the every article is one file and there are close to 8 mill files

The other way is to add PMC in the document grabber and put a loop to grab selective full text

Now I have absolutely no clue how to do that

Thanks

San

kilian.thiel · December 10, 2014, 11:12am

Hi San,

sorry, i did not fully understand your question. To store an index the full PMC data set (50GB) in a queryable manner you need a good database / search engine. Elasticsearch could definitely handle this (http://www.elasticsearch.org/) but this would take some effort.

Querying PMC documents via the Document Grabber seems more easy on first view in my opinion. What is your exact question about this?

Cheers, Kilian

sanatan · December 10, 2014, 4:22pm

Hi Kilian,

Thanks for the reply.

I was thinking of PMC via grabber, as it removes the need to create the dictionary for the first list (PMC does that for you) and you don’t have to phrase all nxml files from PMC which take 72 hrs min on an i7 and 12 gig ram.

Primary list: a₁, a₂, ...a_n

Association list 1: b₁, b₂, .....b_n

Aim is to identify in all the articles with a1 what is the correlation between list1 and list2

This is what I am doing right now:

Phrasing all nxml PMC(72 hrs min) and mining primary list (and its synonyms) then sentence extraction and then identifying correlation between list1 and list2

What I am looking for is a way to selectively work on articles associated with primary list only.

Thanks

San

kilian.thiel · December 16, 2014, 7:16pm

Hi San,

are the association lists: b1, b2, b3, ... bn lists of terms or articles? As far as I understood these lists are lists of terms and you want to compute pairwise correlations between two sets of documents d1 and d2. All documents of both sets contain terms a1, a2, ... an. This can be ensured by using these terms as query terms in the Document Grabber. The resulting documents are than split up into two sets. One set containing terms of association list 1, the second set containing terms of association list 2.

What exactly do you mean with correlation in this context? One possibility would be to compute the pairwise cosine distance between documents of set 1 and 2.

Cheers, Kilian

system · June 2, 2023, 9:49pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.