Clustering geo datas by user and by date

Anaworfis · January 12, 2022, 4:47pm

Hello,

I have a csv file, with 4 columns : user (string), date(integer, in fact it’s multiple columns), latitude (int), longitude (int). It correspond to a photo location, took by a certain user a certain day.

I want to reduce my datas, by only counting 1 photo per area per user per day. If a certain user took 100 photos on a certain place, and 5 on another, I’d like to have 2 locations linked to that user and that day, so this user doesn’t weight too much in my datas.

I guess it’s possible to seperate my data for each user & day, and then cluster on those datas, take the centroid and join all together, but I don’t know how to do that. I have more than 10 years & thousands of users in those datas.

I’m new to knime and forums in general.
Thanks for the help

PS: If you have a way to weight each photos by (1/number of photo taken by this user this day) and cluster with a weigth, I’m also taking it.

aworker · January 12, 2022, 4:58pm

Hi @Anaworfis and welcome to the KNIME forum !

This sounds an interesting problem and your description is crystal clear. However, would it be possible for you to share here a bit of data (enough to have a meaningful clustering) so that we can work with it and suggest solutions. I’m not saying in this case that it is really necessary but it would greatly help to quickly setting up and sharing with you a solution. Looking forward to helping

Kind regards,

Ael

Anaworfis · January 12, 2022, 5:18pm

Hi aworker,

thanks for your answer. Here are some datas :

photo_id ; user ; day ; month ; year ; latitude ; longitude
1 ; 1 ; 5 ; 12 ; 2021 ; 45.752 ; 4.828
2 ; 1 ; 5 ; 12 ; 2021 ; 45.753 ; 4.828
3 ; 1 ; 5 ; 12 ; 2021 ; 45.752 ; 4.827
3 ; 1 ; 5 ; 12 ; 2021 ; 45.931 ; 4.531
4 ; 1 ; 5 ; 12 ; 2021 ; 45.930 ; 4.532
5 ; 2 ; 5 ; 12 ; 2021 ; 45.752 ; 4.828
7 ; 1 ; 6 ; 12 ; 2021 ; 45.752 ; 4.828
8 ; 1 ; 6 ; 12 ; 2021 ; 45.753 ; 4.828
9 ; 1 ; 6 ; 12 ; 2021 ; 45.930 ; 4.532
10 ; 2 ; 6 ; 12 ; 2021 ; 45.752 ; 4.828

What I want from these datas :

user ; day ; month ; year ; latitude ; longitude
1 ; 5 ; 12 ; 2021 ; 45.753 ; 4.828 (center of a cluster with 3 photos taken at the same place, by the same user at the same day)
1 ; 5 ; 12 ; 2021 ; 45.931 ; 4.531 (2 photos)
2 ; 5 ; 12 ; 2021 ; 45.752 ; 4.828 (same place and day, but another user)
1 ; 6 ; 12 ; 2021 ; 45.753 ; 4.828 (2 photos, another day)
1 ; 6 ; 12 ; 2021 ; 45.930 ; 4.532
2 ; 6 ; 12 ; 2021 ; 45.752 ; 4.828

Kind regards,

Ana

aworker · January 12, 2022, 5:31pm

Hi Ana,

Thanks for sharing a bit of data. Clustering your data based on “user ; day ; month ; year” would not be a problem and a -GroupBy- node should easily do the job. This is because these data are discrete.

However, this is not the case of a GPS coordinate which is continuous data. Thus the question is, do you have a rational to say when two “different GPS coordinates” should be considered as “different locations” ? For that, one needs to define a distance threshold beyond which two “different GPS coordinates” should be considered as “different locations”.

This can be handled either using a predefined ad hoc threshold (set by yourself) or using a threshold estimated from data.

The former is the easiest way, the latter is a bit more involved Which one would you be aiming ? Would first solution be “good enough” for you to start with ? Which distance threshold should be reasonable in this case ?

I hope I’m clear enough

Best regards,

Ael

Daniel_Weikert · January 12, 2022, 6:14pm

Hi,
I would go with the Group by solution as @aworker mentioned. Then use mean for coordinates? Maybe I misunderstood the question.
br

aworker · January 12, 2022, 6:24pm

Hi @Daniel_Weikert

Just to make my questioning/reasoning a bit clearer, a “same user” could be the same “day, month & year” at two completely different locations of the world which would make the “mean coordinates” meaningless. This is a extreme example but how the location of two neighbor monuments captured on photos by the same “person, day, month & year” should be considered ? The photos could be completely different. That’s why may be a distance threshold needs be set

Best

Ael

Daniel_Weikert · January 12, 2022, 6:49pm

Thanks for elaborating on this. You are right. The easy solution would not work them. I assumed the photograhper takes the Fotos in the same spot(same location) for the day
br

Anaworfis · January 12, 2022, 7:00pm

Thanks for your replies.

The problem Ael pointed is exactly what I’m trying to solve.

I’ll go for the first solution, but can you explain a bit more how to define a distance threshold and use it in a groupBy ? I saw that a few nodes like “Numerical distances” and “geo distances” could be useful, but I’m not sure about that.

Thanks a lot

Ana

aworker · January 12, 2022, 7:18pm

Hi Ana

I’ll explain and upload tomorrow a possible solution here when I’m back to office

Good evening
Ael

Anaworfis · January 12, 2022, 10:40pm

Alright !

I also find out about Group Loop Start. So what about : Group Loop start (user and date) → k-means (lat and long) → Loop End ? There are still some problems, but I also might be a bit tired.

Best
Ana

andrejz · January 13, 2022, 7:52am

Hi,

Find attached a workflow for starting point. Your location data has 3 decimal point so the approximate distance between places is max 111 m (Decimal degrees - Wikipedia).
The idea … sort the data and then calculate the distance between the two nearest coordinates (the node uses the Haversine formula: Haversine formula - Wikipedia, the result is in km if you want the result in m you have to change the r from 6371.00 to 6371000.00) and if the difference is less than 200 m (you can change this in the last Rule engine node) the places are grouped.

GroupByDistance.knwf (57.7 KB)

Hope it helps you in some way

Anaworfis · January 13, 2022, 2:54pm

Hi,

It works very well ! Coupled with a reference row filter, I managed to get what I wanted.

Thanks a lot to all of you !

Ana

Thyme · January 13, 2022, 5:20pm

I’m sorry to rain into your parade, but this does not work for locations that have latitudinal overlap. Due to the sorting, clusters with the same latitude will get their rows intermixed, resulting in faulty clustering. This is because the sorting behaves sort of like a projection. I made a malicious sample consisting of 2 clusters, 5 rows each, that get sorted into 6 clusters.

@Anaworfis Do you happen to have the time when the pictures have been taken? We could sort by that instead to solve that issue. If not, I’d be curious what @aworker has thought of.

Note for future me: No experience with distance matrices, but we need to compare each coordinate pair with all coords (within same date and user). Either recursive loop or some Java Snippet magic. Better if someone comes up with something less shady, like extensions that are built for such a task.

GroupByDistance.knwf (63.3 KB)

Anaworfis · January 13, 2022, 5:31pm

I’m also using the hour at which the photo has been taken. And I used 3 digit precisions for lat and long in my example because it was what Knime was showing, but I knew it was more precise. In fact I have 6 digits avec the dot. Do you think the problem will still happens ?

Best
Ana

Thyme · January 13, 2022, 5:44pm

The precision of the coordinates does not matter for that problem. Assuming no photos are taken around midnight, the use of the date is fine. But also using hours means that photos from the same location shortly before and after the full hour will be put into two separate clusters. It mitigates the problem I found in andrejz’s solution, but does not remove it.

Putting the entire date&time information into a Date&Time column gives a nice continuous column that we could use to sort the photos chronologically. People cannot teleport (yet), so this should work. Using date + hours might be fine, but I’d like to include the minutes as well if you have them.

Anaworfis · January 13, 2022, 5:51pm

I do have the minutes, so I’ll do that ! Thanks

andrejz · January 14, 2022, 8:16am

Hi,

Here is another version of the workflow you can work on

GroupByDistance II.knwf (67.8 KB)

Regards

Thyme · January 14, 2022, 12:56pm

Sorry to be the grinch again, but that solution does not work with the malicious sample. It produces 6 clusters with a total of 24 (!) photos. Another note for future people: Cross joins on big tables will crush your hardware, as they square the number of rows. n rows go in, n^2 rows come out. Not saying it’s bad, but we have millions of rows.

I also found two problems with my idea of sorting photos chronologically emerging from the assumption that users do not go back to a previous location.

If a user takes photos on location A, then B, then A again, all within the same day, it will result in 3 separate clusters. This might be intended behaviour though.
If a user takes photos on location A, then a day later again on A, it will result in a single cluster. This has been adressed by adding a time difference column in addition to the spatial distance. Photos taken more than 6 hours apart will be assigned different clusters. WF below.

One more thing: Using this method also orders the photos spatially. If a user takes photos while going down a street farther than the spatial threshold, then comes back to the start, the latter photos will be assigned to a different cluster. This is probably not intended, but might be OK if the required resolution is not too fine grained. @Anaworfis can you work with that? If not, I’ll have a go at the more sophisticated solutions I mentioned earlier.
GroupByDistance3.knwf (192.6 KB)

aworker · January 14, 2022, 1:42pm

Hi @Thyme

I agree with you that @andrejz’s solution may not cover all the possible scenarios but I believe too that it provides a convenient generic and easy to understand solution. @Anaworfis’s problem is not as easy to solve as it may seem, if one wants to tackle all the possible “border-effects” given the nature of the problem and the data. The solution I started to implement is much too complex. I guess @andrejz’s solution is good enough for @Anaworfis requirements since she has validated it but again, your comments are wise and make sense

Best

Ael

andrejz · January 14, 2022, 3:12pm

Hi,

I know and agree that my solution is not ideal and works for all cases but it can be a starting point to one universal solution.

The workflow on malicious sample that @Thyme prepared works well and gives the correct result. The problem is that some places are more than 200 m from each other. If you change the treshold in the Row splitter node from 0.2 to 0.3 it gives the correct result (I do not know why it doesn’t work well with 0.2 … I will work on it).

Regards