Iterating through big data data frame, calculating smaller data frames

Hello, this is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is the real stumper for me.

  1. Grab a large data file based on a parquet - no problem

  2. Select 5 columns from the parquet and create a dataframe - no problem
    import pandas
    df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,
    columns=[“ts”, “session_id”, “event”, “duration”, “tags__artifact”])

  3. But here is where it gets a bit tricky for me. One column(a key column) is called “session_id” . Many values are unique. Many duplicate values(of session_id) exist and have multiple associated entry rows of data. I wish to iterate through the master dataframe, create a unique dataframe per session_id. Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the “duration” column per session_id. Again that SUM would be unique per unique session_id, so each sub dataframe would have it’s own SUM with a row added with that total listed along with the session_id I’m thinking there is a nested loop formula that will work for me but every effort has been a mess to date.

  4. Ultimately, I’d like to have a final dataframe that is a collection of these unique sub dataframes. I guess I’d need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data. I should be able to do that simply

  5. Finally, write this final df to a new parquet file.     Should be simple enough so I won't need help with that.      
    

But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows

Hi @marcus8

Sometimes, figuring out how to slice and dice a pandas.DataFrame in exactly the way you want can be challenging. Being able to tease apart the steps visually (hello KNIME!) can be a huge help.

I am not sure I caught all of what you explained in your 3rd step but I’ll offer this pointer: it really sounds like you want to check out the GroupBy node. If you pass this pandas.DataFrame out of a Python node, then connect to a GroupBy node, you can have it group together rows belonging to each and every unique “session_id” and then the “aggregation” you want to perform on each group would be the SUM operation on your “duration” column. Check out the output from that GroupBy node to decide what you want to do next. From what you describe, I’m not sure you need to do any extra looping but I might be misunderstanding something.

Hope this helps,

Davin

3 Likes