I try to load Norwegian climate data into a PostGIS installation via a staging layer. I have the data in text format but run into problems with data width as can be seen from the title using CSV reader, and DB insert and DB loader respectively. I thought about several solutions:
- As the data is spars, I would try the CSV reader to replace the missing data string (-9999) by null hopefully reducing the size of a single record enough to avoid the error. Problem: I didn’t find any option to do so.
- Make the CSV reader to interpret the data as string type hoping to reduce the size of a single record hoping only the actual size of the string gets transferred instead of a “big” type. Problem: I didn’t find any option to do so.
- A combination of both with the same problems.
- Use a file reader and make the column type a string. Problem: Its 1195 columns I would have to change, and I am reluctant to fiddle the respective XML file in the workspace directory.
- Use a file reader and replace missing data value (-9999) by null. Problem: It does not do it. Maybe because of non-identical types, but I failed to make them identical.
- Use R as proposed in Force KNIME to Read CSV Columns As Strings however R is new to me, so I am reluctant to adopt it. If taking R, I might load NetCDF data in the first place just as well as insinuated in Importing NetCDF in Knime with an R snipped -> output multible tables and Merge two time series with different frequencies
- Split the data flow columnwise and load the streams separately. Shortcomings: It feels inelegant as it would involve insert the first half of the columns to be updated by the second half provided a key can be found.
Has somebody ideas to my problem(s)?
Load_ASC_Data.knwf (101.5 KB)
Data excerpt and config file first_seven.zip (451 Bytes)
First I would try to figure out the source of the error.
Using google I found this:
The maximum number of columns for a table is further reduced as the tuple being stored must fit in a single 8192-byte heap page. For example, excluding the tuple header, a tuple made up of 1600
int columns would consume 6400 bytes and could be stored in a heap page, but a tuple of 1600
bigint columns would consume 12800 bytes and would therefore not fit inside a heap page.
So as far as I understand this, if you have too many numeric* columns that are not integers but something that takes more space, then your rows take up too much space for postgresql. * text columns can be stored outside of the row.
No, it’s just too many columns. If we suppose that the average width of
your numeric columns is 12 bytes, then you need 790*12 = 9480 bytes,
which doesn’t fit on an 8K page even without any accounting for row
header and page header overhead. TOAST can’t help by pushing values
out-of-line, because a TOAST pointer is 18 bytes so it’d actually make
things worse. (TOAST works well for individual large fields, but not
at all for this case.)
You can get away with 790 columns if they’re 4 or 8 bytes apiece, or if
many of them are NULL, but evidently these particular numeric values
average more than that.
In practice, though, there are a lot of other inefficiencies with
tables that have so many columns. So I counsel looking into arrays
or some other approach.
Given this you will need to check if you really need all these columns and potentially if the data can and needs to be normalized. At least you will need to split the columns over 2 tables or but multiple of them together in an array column.
tl;dr: This is a limitation of PostgreSQL and has nothing to do with KNIME and is simply due to having too many columns.
Thank you for your reply. I am very aware that my problem is not KNIME but the amount of data, I was trying to load into PostgreSQL with one single row. Reducing this amount is the purpose of my solution proposals one through seven. But I see you have come forward with an eighth. I really could split the stage table in two and join them in a later process. I am not sure whether I like this more than my seventh. I need to think about it. Well, probably yes. It is better with respect to performance as an update as proposed in solution seven is a costly operation. Your proposal needs more space in the database, at least temporarily.
depending on the actual datatype setting a couple of column to null won’t help and having that many columns tells me something is wrong with the design of the table/database anyway or maybe it’s a use case for something NoSQL. Do you need to join or aggregate the rows?
Well, flaw would be then to try to read climate text raster data I have got from senorge.no into PostGIS. However, but to the nature of raster data on a country like Norway as I understand it, not only a a couple of values actually are null but man of them. I want to connect the climate data to named places so I would need to join to OSM data. Aggregation is not a Topic for now.
You could try these extensions although they seem to be marked beta it could be worth a try.
Then how about you give R a chance since the code should not be that complicated and the example basically has it. And from a little google search it is very possible you will need R to load your netcdf data.
This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.