Hello, this is my first post here and it’s based upon an issue I’ve created and tried to solve at work. I’ll try to precisely summarize my issue as I’m having trouble wrapping my head around a preferred solution. #3 is the real stumper for me.
Grab a large data file based on a parquet - no problem
Select 5 columns from the parquet and create a dataframe - no problem
df = pd.read_parquet(’/Users/marmicha/Downloads/sample.parquet’,
columns=[“ts”, “session_id”, “event”, “duration”, “tags__artifact”])
But here is where it gets a bit tricky for me. One column(a key column) is called “session_id” . Many values are unique. Many duplicate values(of session_id) exist and have multiple associated entry rows of data. I wish to iterate through the master dataframe, create a unique dataframe per session_id. Each of these unique (sub) dataframes would have a calculation done that simply gets the SUM of the “duration” column per session_id. Again that SUM would be unique per unique session_id, so each sub dataframe would have it’s own SUM with a row added with that total listed along with the session_id I’m thinking there is a nested loop formula that will work for me but every effort has been a mess to date.
Ultimately, I’d like to have a final dataframe that is a collection of these unique sub dataframes. I guess I’d need to define this final dataframe, and append it with each new sub dataframe as I iterate through the data. I should be able to do that simply
Finally, write this final df to a new parquet file. Should be simple enough so I won't need help with that.
But that is my challenge in a nutshell. The main design I’d need help with is #3. I’ve played with interuples and iterows