Using Get Request

Can anyone explain how to use get request to obtain data from the web? I am trying to obtain local postal codes by street names. I have two different large data sets (over a million rows). One data set already has a column by the postal code, the other by street names.

  1. How do i use get request to find the postal code for street names in a column?
  2. Is it smart to break up the data to smaller size through sampling or partitioning? Or can i get the information by for the entire dataset and how quickly can it be processed?
  3. Should i use any specific request/response headers?

Below is the workflow that i am currently working through.

Link to data:

hi
you enter the url into the get request node and it returns the content of the page if the request was succesful. (status code 200)
For Geo Data you probably want to query an API (e.g. from Google Maps). You might want have a look at that to get started.
br

1 Like

Thanks Daniel,
i get a load of information in the description. It doesnt seem to have the necessary information i am looking for.

Hi @MichaelEkine and welcome to the Knime Community.

Based on your questions, it is not clear if you fully understand what you are asking, and that makes it hard to answer them.

Take this question for example:

That’s like asking “How to I use the street to get to the supermaket?”. Well you can go on foot or in a vehicle.

So in your case, you need either an api or a website and you enter their url in the Get Request node like @Daniel_Weikert mentioned.

It depends on how many customers can the supermarket take and how fast they can serve the customers.

In your case, it depends on how many requests the api/website allows you to send in a period of time, and how fast it can process your request. Generally speaking, you should send in batches (Chunk Loop) with some delays (5 seconds) in between (Wait node)

This is specific to the supermaket. It’s like asking “can I wear sandals?” or “can I bring my pet?”

So in your case, it depends on what the api says. The would specify this in their documentation.

2 Likes

@bruno29a I really like your metaphor. Well explained!
@MichaelEkine Bruno mentioned a great tip at the end. The best way to figure out what you need is go to the API documenation directly. The doc usually tells you exactly what you need (headers, params,…)
br and take care

1 Like

Hi @bruno29a
I found your response a bit heavy handed and not helpful. Here are the following reasons why:


  1. As you noted, i am new to the KNIME community. I found out about KNIME due to a school course. I have a project and i am trying build capacity and understand how to use KNIME to perform analysis. I have no technical background from a coding or programmer perspective. I do have some academic experience with statistics. So when i ask a question, its because i have an idea of what i am trying to do.
    I tried @Daniel_Weikert suggestion with regards to using google API’s and i have spent time trying to learn and understand the technicalities behind using API’s to get my answer. I have two sets of data:
    a) covid data that has locations based on “FSA Column” see below and take note of row size
    image
    b) parking ticket data based on “location2 column” see below and again take note of row size
    image

So i needed find a way for location to be have postal codes so i can have a source of commonality between the two datasets for analysis.


  1. With regards to this, my initial attempts to use the api’s for the data sets for parking tags, seemed to take over 8hrs, i literally saw the percentage bar not move in the space of 8hrs (i started in the evening and by next morning errors). Hence my questions.


  2. I tried looking at youtube for tips on using get request (this link to be exact: Data Access with KNIME: Sending a GET Request to a REST service - YouTube) and i thought maybe my request or response headers might be the issue.

Below is my attempt at following the example in the youtube video
image

I limited the rows in the 2019 parking tags data to 5000
image

Added a constant value node
image

That created a new column using a generic address

used a replace feature
image

this was the result
image

used another one to replace the constant value with the values in "FSA column and get rid of the spaces
image
image![image|624x52]

This is the error when i run the get request
image

image

This is the sample of the internet and result i am trying to get
image

I am honestly trying to learn. I enjoy the learning but i do have a deadline on the project and i want to do a good job.
I would appreciate any tips on getting a successful result.
Thanks

1 Like

Hi @MichaelEkine , I am sorry if you took offence.

The questions that you asked were not exclusively related to Knime, so even if you are new or not to Knime, they are more related to web requests, and it sounded like you were not asking the right questions. That is why I tried to come up with an analogy were you could relate to what kind of questions you were asking thinking that it could help you understand.

I tried using the analogy, and also answer your questions at the same time, hence the “So in your case” parts.

I am sorry if they were not helpful, and that you felt they were heavy handed.

If you don’t mind a last advice from me (last one cause I don’t want conflicts), do not share your api key. It appears in the screenshots. Any requests that use that api key will be attached to your account, and if someone has bad intentions, they can abuse google using that key, and YOU will get in trouble, not them.

While you did try to obfuscate the key, it’s visible in the Browser Result.

EDIT: I went back and read carefully all that you wrote. Your overall approach is very good actually.
Just some tips:

  1. Using generic address and Replace: Good approach, but instead of using a generic address (3040 silverthorn…), you can instead use a placeholder, like “##ADDRESS##”, so your url template would be something like https://maps.googleapis.com/maps/api/geocode/json?address=##ADDRESS##&region=CO&key=<your_key>, and then use the placeholder for the replace:
    replace($FSA$, "##ADDRESS##", $location2$)

  2. Replacing space with % in the url: There is a urlEncode() function that you can use that should do this trick for you. You can even do it in the same replace statement above from #1:
    replace($FSA$, "##ADDRESS##", urlEncode($location2$))

  3. Processing the result: The result that you are getting is in JSON format. You can look into JSON Path that can help you extract values or specific values from a JSON data

2 Likes

Hi @bruno29a
I appreciate the apology. I didnt take offense. Thank you for helping out.

I was using get request node and i didnt know if i was using the node correctly.
I honestly tried to see it from your perspective and hence i provided as much details as i could. As it made me review the youtube video multiple times in order to attempt understand the node and provide details to the exam.

Thanks for the advice on the key, its not my key, its an example i saw online.
I will attempt your suggestions and let you know as soon as possible.

Thank you so much.
I really appreciate it :slight_smile:

All good @MichaelEkine .

FYI, the encoding for a space is actually %20, not just %, so you should replace by “%20”.

urlencoding would actually replace with a “+”, which should also work. Both “+” and “%20” are acceptable.

One last thing that I forgot to include in my previous post is about “break up data to smaller size”. As I “explained”, this is more about how many requests does googleapi allow you to send per “X” amount of seconds/minutes.

The reason why they put a limit is because you could overload them, and also for them to protect themselves against such attacks. They would usually specify how many requests you can send for a period of time.

The good thing here is that Knime’s Get Request has the options to set that up, so you don’t have to manually implement that (though you still could, through a Chunk Loop).

That’s where the Delay and Concurrency come in:
image

You can specify how much delay you want between the requests. You specify how many milliseconds (1 second = 1000 milliseconds, so 5 seconds = 5000 milliseconds) you want to wait, and Concurrency is how many requests you want to send at a time.

So, let’s say google says you can send 1000 requests at a time, and you have to wait 3 seconds in between, you would set the options like this:
image

2 Likes

It worked :slight_smile: Thanks @bruno29a

Thank you so much, will use the JSON path to extract the exact information i need.
Will reach out to the community if i need more help

2 Likes

Nice @MichaelEkine , good stuff! :+1: