Hi, I need to analyze in Knime tables from hundreds of files .docx
every file include 7 4-columns tables
the R library “docxtractr” easily splits the tables, but I need to perform it in Knime, so I tried using R snippet, but till now I just don’t succeed
I’m not a programmer, and just don’t understand how to transpose R code into R snippet, but I also have a doubt: is R library “docxtractr” supported by the R snippet?
The R snippet should be able to use nearly all R packages. The result is a R data.frame so if you want to extract information from a file the result must be a data.frame.
The easiest way would be if you could provide us with an example .docx file (could contain dummy data) and your R code so we might have a look.
Based on your code I put together a rough solution with KNIME. You might further adapt that to your needs. If the structure of the tables from the docx would be the same it would be possible to use one R Snippet and bind the results together by rbind.
I collect all the tables in a tibble in R, store that as a .RDS file and read it back to KNIME and then extract all the tables and collect them. The varying structures are caputed by and Loop End node. You might have to process the data further.
your workflow is great! It made me understand a lot of things about how to set up this kind of jobs, and the output shown is a really big step towards my goal.
My problem now is to implement it, because my level is very basic (I have no mastery of flow variables, and I’m a newbie in R - but I’m trying to learn!)
Till now I’m trying to set up the R Source (Table) node in order to execute the workflow with my real data, but now it is clear to me that I need detailed instructions on how to compile the R script (“what should I write and where”) so I hope not to be boring if I ask you to go into the details about this… Thanks anyway!
Glad I could be of any help. What parts of the R code would you like to have more explanations about? The good thing is with R and KNIME you could substitute some functions and still use the comfort of the KNIME software.
And you find a lot of examples and code for a lot of problems on the net. The task then is to adapt that to the R Snippet. One way to learn about KNIME might be to take a look at a few examples in my repository or elsewhere on the hub.knime.com starting with kn_example_r_…
Flow variables can be a little bit scary at the beginning but they have very useful functions and can help a lot to structure the workflow and transport informations. There are some good explanations.
And the I would encourage you just to keep using KNIME, there might be several ways to do things in KNIME and you can always learn on the go. And there are a lot of (free) resources to learn about KNIME and then there is always the forum and the hub.knime.com.
I must say that the added value of knime is the human factor, the possibility of coming into contact with experts like you who help others to grow in the knowledge of tools that have the potential that otherwise in this case I would not have been able to exploit, so , first of all, thanks mlauber and thanks knime community!
As I said I am not a programmer, I am a beginner in R and in knime I am almost self-taught, and I am sure that some of my questions are “stupid” because a better knowledge of R and the R Source node would be enough to get there by itself, but time is short and I have a deadline, so here are my problems:
Set work directory: I can’t get it to work: what should I do?
workpath_r <- knime.flow.in[[“context.workflow.absolute-path”]]
workspace_name <- paste0(workpath_r, “/workspace_r.RData”)
dataobject_name <- paste0(workpath_r, “/datalist_r.rds”)
setwd(workpath_r) # Set work directory
save and load the working environment, saveRDS: how to implement these commands? Is it enough to remove the hashtags?
First thing would be to make sure R does function at all. I wrote a lengthy piece about that; maybe you could check it out.
The about .rds and .RData. With ads you can store a single object from R and later reload it, that is what I used to store the tibble object (“datalist”) and reload it in the second R Snippet. RData stores the whole active environment (save.image() and load()). I deactivated some lines with hashtags (´#´) that are not currently needed you can reactivate them by removing the hashtags.
With getwd() and setwd() you set the working directory of R. By default that is on a user’s space somewhere and you ca see it by getwd(). In this case I used the folder of the KNIME workflow. But you might change that.
This would transfer all the collected tables from the docx into a single table but since the structure is different that will not work in this case.
I first updated R (now 3.6.1 64 bit)
then RStudio (now 1.2.5019 64 bit)
then installed docxtractr (0.6.1) (both on R & RStudio) (downloading the package)
then installed Rserve 1.7-3.1 (both on R & RStudio) (downloading the package)
(I can’t install Rserve_1.8-6: “Can not create symbolic link: the requested privilege does not belong to the client”; I tried with > install.packages(“Rserve”, “http://rforge.net/”, type = “source”, INSTALL_opts = “–no-multiarch”) but I got this:
C:/Program Files/R/R-3.6.1/Rtools/mingw_64/bin/gcc -I"C:/PROGRA~1/R/R-36~1.1/include" -DNDEBUG -DRSERVE_PKG -DWin32 -I. -Iinclude -Iinclude/Win32 -O2 -Wall -std=gnu99 -mtune=generic -c RSserver.c -o RSserver.o
sh: C:/Program: No such file or directory
make: *** [C:/PROGRA~1/R/R-36~1.1/etc/x64/Makeconf:208: RSserver.o] Error 127
ERROR: compilation failed for package ‘Rserve’
restoring previous ‘C:/Users/10487/Documents/R/win-library/3.6/Rserve’
Warning in install.packages :
installation of package ‘Rserve’ had non-zero exit status
then installed Cairo 1.5-10 (both on R & RStudio) (downloading the package)
then uninstalled Knime Extensions
then restarted R, RStudio and Knime
but the R Source (Table) node does not complete the run.
submitting the code row by row I get: "there is no package called ‘docxtractr’ " and all the error messages are correlated to it (see the screenshot)
Moreover, seems it’s necessary to have root privilege (unfortunately I haven’t) to have Rtools installed in order to be able to make docxtractr work… may I have your opinion?
Not exactly sure what you mean by installed by both R and RStudio. Typically you should only have one R installation with one library. You could try and check your version’s library with this comand: .libPaths()
The error messages hint that you do not have the relevant package docxtractr installed.
And you will have to install RServe in the latest version. One thing I had success wit recently is described here:
After you have done that you could tell RStudio to find the Rtools necessary to compile new packages
devtools::find_rtools()
To repeat that step by step:
install the latest stable version of Rtools to the directory c:\Rtools (no fancy path please) and Devtools
set the environments of the path to RTools accoring to this entry
set the PATH to RTools in the “.Renviron” file by typing:
usethis::edit_r_environ()
this should open the “.Renviron” file (yes with a dot at the beginning). There you should be able to enter the necessary PATH
PATH=“C:\Rtools\bin;${PATH}”
run the selection of the path
tell RStudio to find RTools
devtools::find_rtools()
this should tell you that RTools is now active and you can use that. Now you can try to install RServe in the latest version again. Like this
install.packages(“Rserve”, “http://rforge.net/”, type = “source”, INSTALL_opts = “–no-multiarch”))
this is weird… I installed docxtractr, and it works splitting tables both in R & RStudio… perhaps it is sufficient to call the library to make it work?
I have a problem with this: my status is “poweruser”, not admin, so by now I cannot install anything in c:\ (a request is pending…)
PS: about Rtools I have also this warning:
package ‘Rtools’ is not available (for R version 3.6.1): have I to downgrade the version?
hmm this mixture of libraries might cause problems. Question is what version KNIME is using and if the package is present there. It might be worth a try to completely remove all R versions and start fresh with just one library (not exactly sure how to force that).
Then Rtools might just be installed without R somewhere on your computer. Unfortunately RTools does not ‘like’ complicated path names (might be some bug). You might try:
Thank you Mlauber, first thing I got same library for R & RStudio (it was sufficient to disinstall previous version of R - 3.6.0 - and maintain the upgrade - 3.6.1)
R
.libPaths()
[1] “C:/Users/10487/Documents/R/win-library/3.6”
[2] “C:/Program Files/R/R-3.6.1/library”
Then I disinstalled RTools from the previous directory (within the path o R) and reinstalled in C:/Users/10487/Rtools: surprisingly, now it seems to be in C:/Rtools!
Is it possible?
I also downloaded and unzipped DevTools in C:\Users\10487\devtools_2.2.1, but, even though I read the links, I’m not sure I understood points 2 & 3:
If after your measures RTools is there and the installation of RServe is working everything should be fine. The next steps are there to tell R where RTools is.
Enter it into the environment file which should be read when R starts. If this is not working tell an active R Session what the path is and try to find RTools. So then the active sessions ‘knows’ that RTools is there. I have no idea why this is such a fuss about this tool and why you can’t just enter the path somewhere and tell it what to use (or maybe there is a way and I have not found it yet).
I just can’t create .Renviron file
PS C:\Users\10487> Add-Content c:\Users$env:USERNAME\Documents.Renviron “TEST_VARIABLE_1=my_username”
Add-Content : Impossibile trovare una parte del percorso ‘C:\Users\10492\Documents.Renviron’.
In riga:1 car:1
but I tried installing docxtractr and got this message:
package ‘docxtractr’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\10487\AppData\Local\Temp\Rtmpe66WJU\downloaded_packages
past > library(docxtractr) I got this message:
Errore: package or namespace load failed for ‘docxtractr’:
package ‘Rcpp’ does not have a namespace
I tried to install Rcpp but this was what I got:
Installing package into ‘C:/Users/10487/Documents/R/win-library/3.6’
(as ‘lib’ is unspecified)
There is a binary version available but the source version is later:
binary source needs_compilation
Rcpp 1.0.2 1.0.3 TRUE
ERROR: failed to lock directory ‘C:/Users/10487/Documents/R/win-library/3.6’ for modifying
Try removing ‘C:/Users/10487/Documents/R/win-library/3.6/00LOCK-Rcpp’
Warning in install.packages :
installation of package ‘Rcpp’ had non-zero exit status
The downloaded source packages are in
‘C:\Users\10487\AppData\Local\Temp\Rtmpe66WJU\downloaded_packages’
Is there a solution to this?
How can I create the .Renviron file? - or at least make docxtractr work for knime node?
If you want to use R there is no alternative but to keep trying. If there are locks that might be an indicator that another process is using the workspace. Close KNIME and maybe even restart the system.
To get .Renviron you need
usethis::edit_r_environ()
That should open the file and let you edit it. But you would have to make sure that you are allowed to writ into your directory
However I fear to have to stop… windows powershell says “no”…
PS C:\Users\10487> Add-Content c:\Users$env:USERNAME\Documents.Renviron “TEST_VARIABLE_1=my_username”
Add-Content : Impossibile trovare una parte del percorso ‘C:\Users\10492\Documents.Renviron’.
In riga:1 car:1