Hello, Knimers,
I’m facing a problem that should be very easy to solve: I need to automatically download a series of (monthly) files from the same server, choosing files by inserting “start date” and “end date” of an arbitrary interval. These files are open and contain public data of the Brazilian national health system (from the servers of DataSUS, its IT Department). Such files are available from either one of two different sources: a) a “transfer page” from DataSUS, from where different (either individual or multiple) subsets of data can be selected using a series of six drop-down menus (at < Transferência de Arquivos – DATASUS >), but up to this moment, I only know how to do it manually; or b) an FTP server (where: Protocol: “ftp://”; Host: “ftp.datasus.gov.br”; Path: “/dissemin/publicos/SIASUS/200801_/Dados/”; Generic File name: “PARS2403” (where “PA” stands for the subset of data; “RS” is the particular Brazilian state (from a total of 27), “24” represents the year (2024), and “03” is the month (March)); File extension: “.dbc” (a proprietary format for compressing microdata files); and so, this full generic url becomes: ftp://ftp.datasus.gov.br/dissemin/publicos/SIASUS/200801_/Dados/PARS2403.dbc
Besides, Brazilian Health Ministry provides a free software (TabWin - that uses a graphic interface and is available at the same “trasfer page” cited above) to decompress DBC files (to DBF format) and to convert them into CSV format.
So, the (yet to be solved) task is: how could I automate the download of a series of X files using a workflow in Knime?
After trying (without success) several options (including “FTP connector + Transfer Files”; “GET Request + Binary Objects to Files”; etc.), I have been trying to use the following sequence of nodes:
Create Date&Time Range (for defining Start and End Dates, in the ‘Date&Time’ data type);
Cell Splitter By Position (to generate separate String columns for Year (YY), Month (MM) and Day (DD), all with 2 digits;
String Manipulation (to gather ‘data subset’, ‘state’, ‘year’, ‘mont’, and so to compose each file name in a row);
another String Manipulation (to gather ‘protocol’, ‘host’, ‘path’, ‘file_name’, and ‘file_extension’);
a Table Row to Variable Loop Start (to define ‘files’ and ‘urls’ as flow variables);
External Tool (to access TabWin from inside Knime - but I could not reach to a full configuration of this node with TabWin operating from inside it);
a Loop End node to run all the loops for donwloading all files.
Here is an image of how far could I reach:
As you can see, the functionality of my workflow stopped at the External Tool node…
Can someone help me? Any alternative sequence of nodes, any functional workflow fits equally well to my needs (except using “R” or “Python”, which are currently far beyond my skills and computational resources).
Thank you for all your support. I really appreciate it.
Looking forward to your esponse.
Rogério.
@rogerius1st it would be best if you could provide the workflow you have already done. Also from your description it is difficult to get a complete idea how the URLs are constructed. Is there a list of all the options that might occur in a selection if you want this to be flexible or to iterate over it.
Also what is the role of the “200801” in the URL. Is this fixed (like SIASUS)?
And again I am not sure if it will be feasible or easy to access TabWin with external tools or decompress/decode your files with a script. Also from what I saw on the site. Some information seems to be stored (just?) as tab delimited files. Might that be an alternative to the DBC files?
Hello, Markus. Thank you once again for trying to help me with my issues…
Detailing the answers:
a) Here is the workflow (exported, no reset) including all data. WF to MLauber - 20250530 with data.knwf (31.9 KB)
b) on the question " Is there a list of all the options that might occur in a selection if you want this to be flexible or to iterate over it.“, the answer is “Yes”. There are two options for selecting all files: b.1) manually doing the selection from the “transfer page” with a graphic interface (under the provided link - Transfer page). This one wasn’t my first intention; indeed, I asked for help to automate this search and selection and downloading through Knime → I mentioned a series of six drop-down menus in the “transfer page” (with the provided link). These sequential options are: 1st menu (“Fonte”): 11th option (“SIASUS …”); 2nd menu (“Modalidade”): 2nd option (“Dados”); 3rd menu: 10th option (“PA - Produção Ambulatorial …”); 4th to 6th menus: respectively the variable year, month and state (“UF”). BTW, the first three options are fixed and to be selected just once, the the last three are variables, according to the desired year (either individual or multiple selection). And you were right, this is the point I wanted it to be flexible. b.2) the second path for accessing the files is through an FTP server, via the (variable) url that I mentioned (and tried to apply in my workflow. This “second option” would be necessary just because I don’t know (yet) how to select (and download) one or more files through a graphic interface. If Knime offers this possibility, I would greatly appreciate this teachings (assuming it would not require line commands written in Pythor or R, for I could not manage to install (and make it functional) any of them in my computer. We’ve already been touching the same subject before, but I could not reach this installation). Instead, line commands written in JavaScript would be acceptable, for there is no need for any further instalation within Knime, using the “Java Snipped” node (if there is no other option”…).
c) in the mentioned url, the role of “200801” is the (fixed) directory name (standing for the year and month (YYYYMM) when the new storage system has begun, which means all data since January 2008), and “PARS2403” is the (variable) name for a file containing all data regarding March, 2024. All data files of my interest are between PARS1701 and PARS2503, gathering 99 monthly files. Indeed, until I finish my research, if the next file (with data related to April, 2025) is released, there will be a number of 100 monthly files.
d) as aforementioned, I could not apply your suggested workflow, due to my current limitations while trying to install and run Python/Conda packages. But I thank you for your kind efforts.
e) AFAIK, “.DBC” files are the only option for access to these data (from PA on DataSUS).
f) Besides, I only thought in the use of the “External Tool” node because I found no other functional option.
Thank you once again, Markus.
Best regards,
Rogério.
@rogerius1st first step of you want an systematic download of several files would be to gather all the technical parts of the URL in a at structured way. Maybe an Excel file.
So knime would be able to iterate over the elements and construct the URLs from there.
I will see if I can take a look at the workflow later.
Concerning the use of Java Script. This most likely will not ‘solve’ your problem with programming since if there actually is a function to decompress DBC files in java it would involve installing that.