Loop Variables for API

tc020 · December 6, 2022, 2:41pm

Hi guys,

I want to get all results of an API. To get them, I need to loop diverent “cursors” (you can read here, how it works exactly: API - OpenAlex API documentation).

I tried to make a Loop-Block with the node Generic Loop Start and Variable Condition Loop End. This works so far. But I can’t figure out, how I could flexible change the the cursor in the GET Request Node.

Do you have any ideas how I can solve this problem?

Best regards,
Tim

kabrah · December 8, 2022, 10:01am

Hi tc020,

I didn’t test this myself with the GET Request, but I think the concept that you will need to solve your issue is recursive loops. You can set the parameter “End loop with variable” in the Recursive Loop End node so that you can end the loop based on a variable rule (e.g. there is no next cursor value).

The screenshot below shows how I would structure it, but of course in your case you would need to replace the Math Formula node with a GET Request node. You’ll also have to convert the received content from the JSON/XML you receive.

Cheers,
Karen

tc020 · December 8, 2022, 1:52pm

Hi @kabrah

thank you for you proposed solution. I think this could work. I wanted to try it, but the Recursive Loop End Node says, that he can’t convert the %s (I think string?) to boolean.
Do you have any ideas, how I can I cast it to a boolean in the Rule Engine Variable, if I want to check if there is still a cursor?

Best wishes,
Tim

Daniel_Weikert · December 8, 2022, 6:22pm

string manipulation nodes allow functions like toBoolean. Maybe you could check those
br

tc020 · December 8, 2022, 8:35pm

Hi, thanks for the idea, but I can’t check in this way if there is a string or not. Because the string is always false. But your idea brought me another one: I can check if the string is null. If No then its false. When Yes, then its (I dont know why) also null. But I can fix this boolean null with a Missing Value Node to a true

Its for sure not the most elegant way, but it works now

Thanks to you two! Without you I would still be in the dark!

badger101 · December 9, 2022, 2:42am

@tc020 Can you share the workflow here without any data?

kabrah · December 12, 2022, 9:46am

If you want to have a boolean value, you could use the Variable Expressions node. Although actually in the screenshot, the used variable is also of type string. I’ve uploaded the workflow in case that helps!
recursionRuleEnd.knwf (16.3 KB)

kabrah · December 12, 2022, 9:48am

Really cool that you figured it out @tc020 ! This would be a great workflow to post on hub.knime.com if you feel like it - would be interesting for many I think.

tc020 · December 12, 2022, 2:39pm

Sorry for late reply, I was very busy over the weekend. Of course, I can share my solution with you guys. But I’m quite new to the whole knime topic and I don’t know if this solution is really so efficient, that other users should follow it. You are welcome to have a look at it yourself. And if you think it’s still good enough, I can still upload it to hub.knime.com

Task3-2.knwf (72.4 KB)

My example is now with the test string for workflows in the year 1810

If you need more explanation to the nodes, just ping me. I think the most important thing to know about this workflow is to change the search string on 2 Nodes (I marked them as ‘SEARCH STRING’).

Best wishes,
Tim

badger101 · January 7, 2023, 8:10am

@tc020 Thanks for the workflow. I have ran the GetRequest set by the workflow, and will keep it aside for now, and study them later tonight. I might find it useful to reverse engineer it to what I’m about to do. Will tag you if I find any issue.

tc020 · January 7, 2023, 4:58pm

Oh, sorry for not updating. I already found an issue @badger101
The loop skips the first result of the get request. I fixed this problem. Here is the revised version.

openAlex-API-revised.knwf (68.2 KB)

Thank you very much, if you could check this version for another issues

badger101 · January 8, 2023, 11:19am

Hi @tc020 , I have studied the workflow you shared, along with looking at other people’s work along this line too. I am trying to make a loop for YouTube’s API, but the recursive loop is a little bit overwhelming considering this is my first time. I wonder if you can help take a look at my workflow. Anyone with a Google account can apply for the API key which only takes a few minutes to set up (there are a few Youtube videos covering on how to get the key).

I would like to iterate over all pagination pages when scraping Youtube comments for a given video. This is similar to your case where your checks on cursors is similar to my checking on next-page-tokens. Although, the way Youtube API behaves is different. So here’s what I managed to do so far:

I managed to get one row of JSON (each row contains a max of 100 comments), but I’m looking for a recursive loop method like yours to loop through all pages and get more JSON rows. The loop workaround should be inserted into the purple box as I annotated in the workflow.

Here’s the workflow for you or anyone else to have a look:
Pagination YouTube API.knwf (68.8 KB)

Thank you in advance!

Also tagging @iCFO , @ArjenEX if you guys can take a look at this. Would highly appreciate it.

ArjenEX · January 8, 2023, 5:49pm

@badger101

I’ll take the challenge. WIP:

Testing with maxResult=10 and a random video with 61 comments to force the looping process. The recursive loop automatically handles the pageToken and retrieves 42 comments, so 5 iterations have been performed (42 +10 makes 52 comments in total including the base get request). It automatically exits the loop whenever pageToken = null

So far so good! Going to look for some tweaks and optimizations.

badger101 · January 8, 2023, 7:49pm

Thanks for looking into this @ArjenEX ! I’m in no hurry, so please take as much time as you’d like! I’ll be checking from time to time

P.S. If there’ll be multiple going back and forth and you think I should make a new thread, let me know. Otherwise, I’m happy to just stay here considering the similarities of the two cases

iCFO · January 8, 2023, 9:44pm

I am not about to get into an API call efficiency showdown against @ArjenEX !

ArjenEX · January 8, 2023, 10:24pm

@badger101

Went back and fourth a few times between different setups but ended up with these two that stuck. They are producing the same output.

General notes:

Recursive Loop is the way to go here. It’s tricky to master but it has to be used since the loop needs to “remember” what was just queried and what the URL of the next GET request should subsequently be.
You need to properly control the exit of the loop. In this case, whenever the nextPageToken is not retrievable it means no further pagination is available. Using a pretty simple evaluation with a Column Expression node.

 if (column("nextPageToken") == null) {
    true
} else {
    false
}

With this boolean you can make the Recursive Loop End exit variable controlled. It will override the maximum number of iterations and exit accordingly.

I would personally stay away from JSON to Table as much as possible. It’s quite a system breaker if you have a properly sized JSON. If you only need to retrieve the nextPageToken, then a JSON Path query works a lot faster (there is only one instances of it in the JSON). Retrieval can be achieved via $['nextPageToken']

Same applies for the actual content btw. I figured it’s not very likely that you need every single property. I would only get what you need. For example, I opted for the text and the date. Ensure that you have the List property enabled. Put an Ungroup node after it and it’s all done.

The slight difference between option A and B is mainly around how you would like to manage both streams (pagination and non-pagination). The “A-way” is to to leave the If-switch uncontrolled and set its ports to Both. The top stream handles the pagination or is blocked by the empty table switch otherwise. The bottom stream always takes care of the first page or all comments. I didn’t touch the creation of the new url etc.
The “B-way” separates this more strictly. It evaluates if the nextPageToken is present or not and assigns a single port accordingly. You can then control all flows (pagination - first page, pagination - other pages, non-pagination) individually.
I didn’t fully stick to your original setup but these two methods require fever nodes, for example taking out the 3rd GET Request that you have.

Execution time:
Option A - 192 ms.
Option B - 227 ms.

Other than that I think you’ll find your away around.

WF: Pagination YouTube API version A_EX.knwf (123.6 KB)

Hope this helps!

PS: Apply your own token again first

badger101 · January 9, 2023, 6:31am

Great! I am in the process of studying it now. I have tested with multiple videos with varying comments number, and I find that Option B doesn’t have any issues with videos containing zero comment and videos with comments restricted (not allowed/closed). Meanwhile, Option A gives an error for the latter.

I will therefore choose Option B and will modify it accordingly. Will update here later today.

badger101 · January 9, 2023, 8:02am

Update: Thank you @ArjenEX , I managed to use Option B to my liking, with some modifications! Many thanks to your help, really appreciate this!

Interesting things I found out about YouTube API along the way:

Spam comments detected by YouTube were not included in the API data.
Comments manually hidden by the video owners were included in the API data, but there’s no marker (identifier metadata) for these.

system · April 9, 2023, 8:03am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.