how to import tables from .docx documents via R snippet

Hi, I need to analyze in Knime tables from hundreds of files .docx

every file include 7 4-columns tables

the R library “docxtractr” easily splits the tables, but I need to perform it in Knime, so I tried using R snippet, but till now I just don’t succeed

I’m not a programmer, and just don’t understand how to transpose R code into R snippet, but I also have a doubt: is R library “docxtractr” supported by the R snippet?

Thanks

jj

The R snippet should be able to use nearly all R packages. The result is a R data.frame so if you want to extract information from a file the result must be a data.frame.

The easiest way would be if you could provide us with an example .docx file (could contain dummy data) and your R code so we might have a look.

1 Like

Thank you mlauber,

the R code is:

load("~/xxx/xxxx2019_06/0000_PROVA_SPLIT_TABELLE/r/.RData")
library(docxtractr)
real_world <- read_docx(“C:/Users/xxxxx/Documents/xxx/xxxx_2019_06/0000_PROVA_SPLIT_TABELLE/r/prova01.docx”)
docx_tbl_count(real_world)
tbls <- docx_extract_all_tbls(real_world)
docx_describe_tbls(real_world)
docx_extract_all_tbls(real_world, guess_header = TRUE, preserve = FALSE, trim = TRUE)
docx_extract_tbl(real_world, 1, header=TRUE)
docx_extract_tbl(real_world, 2, header=TRUE)
docx_extract_tbl(real_world, 3, header=TRUE)
docx_extract_tbl(real_world, 4, header=TRUE)
docx_extract_tbl(real_world, 5, header=TRUE)
docx_extract_tbl(real_world, 6, header=TRUE)
docx_extract_tbl(real_world, 7, header=TRUE)

and here is the dummy .docx:

prova01.docx (19.8 KB)

Based on your code I put together a rough solution with KNIME. You might further adapt that to your needs. If the structure of the tables from the docx would be the same it would be possible to use one R Snippet and bind the results together by rbind.

I collect all the tables in a tibble in R, store that as a .RDS file and read it back to KNIME and then extract all the tables and collect them. The varying structures are caputed by and Loop End node. You might have to process the data further.

kn_example_r_docx_import_table.knwf (61.1 KB)

4 Likes

Thank you mlauber!

your workflow is great! It made me understand a lot of things about how to set up this kind of jobs, and the output shown is a really big step towards my goal.

My problem now is to implement it, because my level is very basic (I have no mastery of flow variables, and I’m a newbie in R - but I’m trying to learn!)

Till now I’m trying to set up the R Source (Table) node in order to execute the workflow with my real data, but now it is clear to me that I need detailed instructions on how to compile the R script (“what should I write and where”) so I hope not to be boring if I ask you to go into the details about this… Thanks anyway!

3 Likes

Glad I could be of any help. What parts of the R code would you like to have more explanations about? The good thing is with R and KNIME you could substitute some functions and still use the comfort of the KNIME software.

And you find a lot of examples and code for a lot of problems on the net. The task then is to adapt that to the R Snippet. One way to learn about KNIME might be to take a look at a few examples in my repository or elsewhere on the hub.knime.com starting with kn_example_r_…

Flow variables can be a little bit scary at the beginning but they have very useful functions and can help a lot to structure the workflow and transport informations. There are some good explanations.

https://www.knime.com/wiki/flow-variables
https://www.knime.com/knime-introductory-course/chapter7/section1

And the I would encourage you just to keep using KNIME, there might be several ways to do things in KNIME and you can always learn on the go. And there are a lot of (free) resources to learn about KNIME and then there is always the forum and the hub.knime.com.

2 Likes

Hello mlauber,

I must say that the added value of knime is the human factor, the possibility of coming into contact with experts like you who help others to grow in the knowledge of tools that have the potential that otherwise in this case I would not have been able to exploit, so , first of all, thanks mlauber and thanks knime community!

As I said I am not a programmer, I am a beginner in R and in knime I am almost self-taught, and I am sure that some of my questions are “stupid” because a better knowledge of R and the R Source node would be enough to get there by itself, but time is short and I have a deadline, so here are my problems:

  1. Set work directory: I can’t get it to work: what should I do?

workpath_r <- knime.flow.in[[“context.workflow.absolute-path”]]
workspace_name <- paste0(workpath_r, “/workspace_r.RData”)
dataobject_name <- paste0(workpath_r, “/datalist_r.rds”)
setwd(workpath_r) # Set work directory

  1. save and load the working environment, saveRDS: how to implement these commands? Is it enough to remove the hashtags?

# save.image(workspace_name)
# load(workspace_name)

saveRDS(datalist, dataobject_name)

# big_data = do.call(rbind, datalist)

Till now, the console of the R Source node shows this message:

R cannot be initialized.
R Home is invalid.

PS: thank you for the iterative code in the middle, it’s just as it I should have done it if I was an informatic! :wink:

2 Likes

First thing would be to make sure R does function at all. I wrote a lengthy piece about that; maybe you could check it out.

The about .rds and .RData. With ads you can store a single object from R and later reload it, that is what I used to store the tibble object (“datalist”) and reload it in the second R Snippet. RData stores the whole active environment (save.image() and load()). I deactivated some lines with hashtags (´#´) that are not currently needed you can reactivate them by removing the hashtags.

With getwd() and setwd() you set the working directory of R. By default that is on a user’s space somewhere and you ca see it by getwd(). In this case I used the folder of the KNIME workflow. But you might change that.

This would transfer all the collected tables from the docx into a single table but since the structure is different that will not work in this case.

2 Likes

Hi Mlauber,

I read your links, so:

I first updated R (now 3.6.1 64 bit)
then RStudio (now 1.2.5019 64 bit)
then installed docxtractr (0.6.1) (both on R & RStudio) (downloading the package)
then installed Rserve 1.7-3.1 (both on R & RStudio) (downloading the package)
(I can’t install Rserve_1.8-6: “Can not create symbolic link: the requested privilege does not belong to the client”; I tried with > install.packages(“Rserve”, “http://rforge.net/”, type = “source”, INSTALL_opts = “–no-multiarch”) but I got this:

C:/Program Files/R/R-3.6.1/Rtools/mingw_64/bin/gcc -I"C:/PROGRA~1/R/R-36~1.1/include" -DNDEBUG -DRSERVE_PKG -DWin32 -I. -Iinclude -Iinclude/Win32 -O2 -Wall -std=gnu99 -mtune=generic -c RSserver.c -o RSserver.o
sh: C:/Program: No such file or directory
make: *** [C:/PROGRA~1/R/R-36~1.1/etc/x64/Makeconf:208: RSserver.o] Error 127
ERROR: compilation failed for package ‘Rserve’

  • removing ‘C:/Users/10487/Documents/R/win-library/3.6/Rserve’
  • restoring previous ‘C:/Users/10487/Documents/R/win-library/3.6/Rserve’
    Warning in install.packages :
    installation of package ‘Rserve’ had non-zero exit status

then installed Cairo 1.5-10 (both on R & RStudio) (downloading the package)

then uninstalled Knime Extensions

then restarted R, RStudio and Knime

but the R Source (Table) node does not complete the run.

submitting the code row by row I get: "there is no package called ‘docxtractr’ " and all the error messages are correlated to it (see the screenshot)


Moreover, seems it’s necessary to have root privilege (unfortunately I haven’t) to have Rtools installed in order to be able to make docxtractr work… may I have your opinion?

Thank You!

Not exactly sure what you mean by installed by both R and RStudio. Typically you should only have one R installation with one library. You could try and check your version’s library with this comand:
.libPaths()

The error messages hint that you do not have the relevant package docxtractr installed.

And you will have to install RServe in the latest version. One thing I had success wit recently is described here:

After you have done that you could tell RStudio to find the Rtools necessary to compile new packages

devtools::find_rtools()

To repeat that step by step:

  1. install the latest stable version of Rtools to the directory c:\Rtools (no fancy path please) and Devtools
  2. set the environments of the path to RTools accoring to this entry
  3. set the PATH to RTools in the “.Renviron” file by typing:

usethis::edit_r_environ()

this should open the “.Renviron” file (yes with a dot at the beginning). There you should be able to enter the necessary PATH

PATH=“C:\Rtools\bin;${PATH}”

  1. run the selection of the path

image

  1. tell RStudio to find RTools

devtools::find_rtools()

this should tell you that RTools is now active and you can use that. Now you can try to install RServe in the latest version again. Like this

install.packages(“Rserve”, “http://rforge.net/”, type = “source”, INSTALL_opts = “–no-multiarch”))

This:

install.packages(‘Rserve’,“http://rforge.net/",type="source

Or this (after you downloaded the .tar file and put it in the path. You have to change the path accoring to your local machine):

install.packages(‘~/Downloads/Rserve_1.8-6.tar’, repos = NULL, type=“source”)

1 Like

Good evening Mlauber,

this are the versions of the libraries in R & RStudio

R

.libPaths()
[1] “C:/Users/10487/Documents/R/win-library/3.6” “C:/Program Files/R/R-3.6.0/library”

RStudio

.libPaths()
[1] “C:/Users/10487/Documents/R/win-library/3.6”
[2] “C:/Program Files/R/R-3.6.1/library”

this is weird… I installed docxtractr, and it works splitting tables both in R & RStudio… perhaps it is sufficient to call the library to make it work?

I have a problem with this: my status is “poweruser”, not admin, so by now I cannot install anything in c:\ (a request is pending…)

PS: about Rtools I have also this warning:
package ‘Rtools’ is not available (for R version 3.6.1): have I to downgrade the version?

Good evening, and thank You for your patience…!

hmm this mixture of libraries might cause problems. Question is what version KNIME is using and if the package is present there. It might be worth a try to completely remove all R versions and start fresh with just one library (not exactly sure how to force that).

Then Rtools might just be installed without R somewhere on your computer. Unfortunately RTools does not ‘like’ complicated path names (might be some bug). You might try:

C:/Users/10487/Documents/Rtools
C:/Users/10487/Rtools

and then try to show the path to R/RStudio. You can also try to force the package to a specific library of R and see if that makes any difference.

2 Likes

Thank you Mlauber, first thing I got same library for R & RStudio (it was sufficient to disinstall previous version of R - 3.6.0 - and maintain the upgrade - 3.6.1)

R
.libPaths()
[1] “C:/Users/10487/Documents/R/win-library/3.6”
[2] “C:/Program Files/R/R-3.6.1/library”

RStudio
.libPaths()
[1] “C:/Users/10487/Documents/R/win-library/3.6”
[2] “C:/Program Files/R/R-3.6.1/library”

Then I disinstalled RTools from the previous directory (within the path o R) and reinstalled in C:/Users/10487/Rtools: surprisingly, now it seems to be in C:/Rtools!
Is it possible?

I also downloaded and unzipped DevTools in C:\Users\10487\devtools_2.2.1, but, even though I read the links, I’m not sure I understood points 2 & 3:

1 Like

If after your measures RTools is there and the installation of RServe is working everything should be fine. The next steps are there to tell R where RTools is.

Enter it into the environment file which should be read when R starts. If this is not working tell an active R Session what the path is and try to find RTools. So then the active sessions ‘knows’ that RTools is there. I have no idea why this is such a fuss about this tool and why you can’t just enter the path somewhere and tell it what to use (or maybe there is a way and I have not found it yet).

Excuse me, Mlauber…

  1. I just can’t create .Renviron file
    PS C:\Users\10487> Add-Content c:\Users$env:USERNAME\Documents.Renviron “TEST_VARIABLE_1=my_username”
    Add-Content : Impossibile trovare una parte del percorso ‘C:\Users\10492\Documents.Renviron’.
    In riga:1 car:1
  • Add-Content c:\Users$env:USERNAME\Documents.Renviron "TEST_VARIABLE …
  •   + CategoryInfo          : ObjectNotFound: (C:\Users\10492\Documents\.Renviron:String) [Add-Content], DirectoryNotF
     oundException
      + FullyQualifiedErrorId : GetContentWriterDirectoryNotFoundError,Microsoft.PowerShell.Commands.AddContentCommand
    
    
  1. but I tried installing docxtractr and got this message:
    package ‘docxtractr’ successfully unpacked and MD5 sums checked
    The downloaded binary packages are in
    C:\Users\10487\AppData\Local\Temp\Rtmpe66WJU\downloaded_packages
  2. past > library(docxtractr) I got this message:
    Errore: package or namespace load failed for ‘docxtractr’:
    package ‘Rcpp’ does not have a namespace

I tried to install Rcpp but this was what I got:

Installing package into ‘C:/Users/10487/Documents/R/win-library/3.6’
(as ‘lib’ is unspecified)

There is a binary version available but the source version is later:
binary source needs_compilation
Rcpp 1.0.2 1.0.3 TRUE

installing the source package ‘Rcpp’

provo con l’URL ‘https://cran.rstudio.com/src/contrib/Rcpp_1.0.3.tar.gz
Content type ‘application/x-gzip’ length 2749025 bytes (2.6 MB)
downloaded 2.6 MB

ERROR: failed to lock directory ‘C:/Users/10487/Documents/R/win-library/3.6’ for modifying
Try removing ‘C:/Users/10487/Documents/R/win-library/3.6/00LOCK-Rcpp’
Warning in install.packages :
installation of package ‘Rcpp’ had non-zero exit status

The downloaded source packages are in
‘C:\Users\10487\AppData\Local\Temp\Rtmpe66WJU\downloaded_packages’

Is there a solution to this?
How can I create the .Renviron file? - or at least make docxtractr work for knime node?

… exhausted! :frowning:

If you want to use R there is no alternative but to keep trying. If there are locks that might be an indicator that another process is using the workspace. Close KNIME and maybe even restart the system.

To get .Renviron you need

usethis::edit_r_environ()

That should open the file and let you edit it. But you would have to make sure that you are allowed to writ into your directory

1 Like

I get this:

> usethis::edit_r_environ() Errore: package ‘Rcpp’ does not have a namespace

Rcpp installed at last… Trying next steps!

1 Like

never give up. Sometimes it is not so easy to install R and KNIME but you figured that out already I think …

2 Likes

Thank you for your precious support!

However I fear to have to stop… windows powershell says “no”…

PS C:\Users\10487> Add-Content c:\Users$env:USERNAME\Documents.Renviron “TEST_VARIABLE_1=my_username”
Add-Content : Impossibile trovare una parte del percorso ‘C:\Users\10492\Documents.Renviron’.
In riga:1 car:1

  • Add-Content c:\Users$env:USERNAME\Documents.Renviron "TEST_VARIABLE …
  •   + CategoryInfo          : ObjectNotFound: (C:\Users\10492\Documents\.Renviron:String) [Add-Content], DirectoryNotF
     oundException
      + FullyQualifiedErrorId : GetContentWriterDirectoryNotFoundError,Microsoft.PowerShell.Commands.AddContentCommand
    
    
    

the problem is the ID number, I think

I didn’t know till now about this mess…

my first ID was 10487, my new ID is 10492, but for the system I am still 10487… except sometimes… but I can’t have access to user 10492…

probably this is the tombstone…