Hello-
I’m trying to use the psaw Pushshift Python API Wrapper (GitHub - dmarx/psaw: Python Pushshift.io API Wrapper (for comment/submission search)) to run searches on Reddit posts. My workflow uses a table of topics, then uses a ‘table row to variable loop’ to pass the topics to the python script. My script below works inconsistently. It always works within the python config window. However when I run within the workflow it gets an “Execute failed: (“NullPointerException”): null” error.
I have tried everything such as converting fields to strings in case they are arrays, trying both flatbuffers + Apache arrow for transfer but can’t seem to figure it out.
My code is below, I’d love to hear if anybody is able to get it to work. I think this would be helpful for others to be able to easily search Reddit for topics.
Thanks!
from pandas import DataFrame
Create empty table
from psaw import PushshiftAPI
import pandas as pd
import numpy as np
topicz = flow_variables[‘Topic’]
#topic = ‘’.join([str(elem) for elem in topicz])
topic=‘alexa’
api = PushshiftAPI()
#gen = api.search_submissions(q=flow_variables[‘Topic’], limit=100)
gen = api.search_submissions(q=topic, limit=100)
#gen = api.search_submissions(q=‘fire’, limit=100)
data = pd.DataFrame([obj.d_ for obj in gen])
#print(data.dtypes)
data = data.drop([“media”, “secure_media”, “secure_media_embed”, “allow_live_comments”, “awarders”, “all_awardings”, “author_cakeday”, “author_flair_background_color”, “author_flair_css_class”, “author_flair_richtext”, “author_flair_template_id”, “author_flair_text”, “author_flair_text_color”, “author_flair_type”, “author_patreon_flair”, “can_mod_post”, “contest_mode”, “gildings”, “is_robot_indexable”, “parent_whitelist_status”, “pinned”, “post_hint”, “preview”, “pwls”, “retrieved_on”, “media_embed”, “suggested_sort”, “thumbnail”, “thumbnail_height”, “thumbnail_width”, “total_awards_received”, “link_flair_background_color”, “link_flair_css_class”, “link_flair_richtext”, “link_flair_template_id”, “link_flair_text”, “link_flair_text_color”, “link_flair_type”, “crosspost_parent_list”, “crosspost_parent”, “send_replies”, “steward_reports”, “media_metadata”, “wls”, “spoiler”, “stickied”, “edited”, “content_categories”, “events”, “eventsOnRender”, “third_party_trackers”],axis=1,errors=‘ignore’)
#data[‘allow_live_comments’] = data[‘allow_live_comments’].astype(‘str’)
data[‘author’] = data[‘author’].astype(‘str’)
data[‘author_fullname’] = data[‘author_fullname’].astype(‘str’)
#data[‘awarders’] = data[‘awarders’].astype(‘str’)
#data[‘content_categories’] = data[‘content_categories’].astype(‘str’)
data[‘created’] = data[‘created’].astype(‘str’)
#data[‘created_utc’] = data[‘created_utc’].astype(‘str’)
#data[‘crosspost_parent’] = data[‘crosspost_parent’].astype(‘str’)
#data[‘crosspost_parent_list’] = data[‘crosspost_parent_list’].astype(‘str’)
data[‘domain’] = data[‘domain’].astype(‘str’)
data[‘full_link’] = data[‘full_link’].astype(‘str’)
data[‘id’] = data[‘id’].astype(‘str’)
#data[‘is_crosspostable’] = data[‘is_crosspostable’].astype(‘str’)
#data[‘is_meta’] = data[‘is_meta’].astype(‘str’)
#data[‘is_original_content’] = data[‘is_original_content’].astype(‘str’)
#data[‘is_reddit_media_domain’] = data[‘is_reddit_media_domain’].astype(‘str’)
#data[‘is_self’] = data[‘is_self’].astype(‘str’)
#data[‘is_video’] = data[‘is_video’].astype(‘str’)
data[‘locked’] = data[‘locked’].astype(‘str’)
#data[‘media’] = data[‘media’].astype(‘str’)
#data[‘media_metadata’] = data[‘media_metadata’].astype(‘str’)
data[‘media_only’] = data[‘media_only’].astype(‘str’)
data[‘no_follow’] = data[‘no_follow’].astype(‘str’)
data[‘og_description’] = data[‘og_description’].astype(‘str’)
data[‘og_title’] = data[‘og_title’].astype(‘str’)
#data[‘over_18’] = data[‘over_18’].astype(‘str’)
data[‘permalink’] = data[‘permalink’].astype(‘str’)
data[‘score’] = data[‘score’].astype(‘str’)
#data[‘secure_media’] = data[‘secure_media’].astype(‘str’)
#data[‘secure_media_embed’] = data[‘secure_media_embed’].astype(‘str’)
data[‘selftext’] = data[‘selftext’].astype(‘str’)
#data[‘send_replies’] = data[‘send_replies’].astype(‘str’)
#data[‘spoiler’] = data[‘spoiler’].astype(‘str’)
#data[‘steward_reports’] = data[‘steward_reports’].astype(‘str’)
#data[‘stickied’] = data[‘stickied’].astype(‘str’)
#data[‘whitelist_status’] = data[‘whitelist_status’].astype(‘str’)
#data[‘wls’] = data[‘wls’].astype(‘str’)
#data[‘content_categories’] = data[‘content_categories’].apply(np.ravel)
#data[‘content_categories’] = data[‘content_categories’].astype(‘str’)
#data[‘content_categories’] = ’ '.join([str(elem) for elem in data[‘content_categories’]])
data[‘title’] = data[‘title’].astype(‘str’)
data[‘subreddit’] = data[‘subreddit’].astype(‘str’)
data[‘subreddit_id’] = data[‘subreddit_id’].astype(‘str’)
#data[‘subreddit_subscribers’] = data[‘subreddit_subscribers’].astype(‘str’)
data[‘subreddit_type’] = data[‘subreddit_type’].astype(‘str’)
data[‘url’] = data[‘url’].astype(‘str’)
output_table = data.copy()