[SOLVED] Python Execute failed: ("NullPointerException"): null (Pushshift/PSAW Reddit search)

#1

Hello-
I’m trying to use the psaw Pushshift Python API Wrapper (https://github.com/dmarx/psaw) to run searches on Reddit posts. My workflow uses a table of topics, then uses a ‘table row to variable loop’ to pass the topics to the python script. My script below works inconsistently. It always works within the python config window. However when I run within the workflow it gets an “Execute failed: (“NullPointerException”): null” error.

I have tried everything such as converting fields to strings in case they are arrays, trying both flatbuffers + Apache arrow for transfer but can’t seem to figure it out.

My code is below, I’d love to hear if anybody is able to get it to work. I think this would be helpful for others to be able to easily search Reddit for topics.
Thanks!

from pandas import DataFrame

Create empty table

from psaw import PushshiftAPI
import pandas as pd
import numpy as np

topicz = flow_variables[‘Topic’]
#topic = ‘’.join([str(elem) for elem in topicz])
topic=‘alexa’

api = PushshiftAPI()
#gen = api.search_submissions(q=flow_variables[‘Topic’], limit=100)

gen = api.search_submissions(q=topic, limit=100)
#gen = api.search_submissions(q=‘fire’, limit=100)

data = pd.DataFrame([obj.d_ for obj in gen])
#print(data.dtypes)
data = data.drop([“media”, “secure_media”, “secure_media_embed”, “allow_live_comments”, “awarders”, “all_awardings”, “author_cakeday”, “author_flair_background_color”, “author_flair_css_class”, “author_flair_richtext”, “author_flair_template_id”, “author_flair_text”, “author_flair_text_color”, “author_flair_type”, “author_patreon_flair”, “can_mod_post”, “contest_mode”, “gildings”, “is_robot_indexable”, “parent_whitelist_status”, “pinned”, “post_hint”, “preview”, “pwls”, “retrieved_on”, “media_embed”, “suggested_sort”, “thumbnail”, “thumbnail_height”, “thumbnail_width”, “total_awards_received”, “link_flair_background_color”, “link_flair_css_class”, “link_flair_richtext”, “link_flair_template_id”, “link_flair_text”, “link_flair_text_color”, “link_flair_type”, “crosspost_parent_list”, “crosspost_parent”, “send_replies”, “steward_reports”, “media_metadata”, “wls”, “spoiler”, “stickied”, “edited”, “content_categories”, “events”, “eventsOnRender”, “third_party_trackers”],axis=1,errors=‘ignore’)

#data[‘allow_live_comments’] = data[‘allow_live_comments’].astype(‘str’)
data[‘author’] = data[‘author’].astype(‘str’)
data[‘author_fullname’] = data[‘author_fullname’].astype(‘str’)
#data[‘awarders’] = data[‘awarders’].astype(‘str’)
#data[‘content_categories’] = data[‘content_categories’].astype(‘str’)
data[‘created’] = data[‘created’].astype(‘str’)
#data[‘created_utc’] = data[‘created_utc’].astype(‘str’)
#data[‘crosspost_parent’] = data[‘crosspost_parent’].astype(‘str’)
#data[‘crosspost_parent_list’] = data[‘crosspost_parent_list’].astype(‘str’)
data[‘domain’] = data[‘domain’].astype(‘str’)
data[‘full_link’] = data[‘full_link’].astype(‘str’)
data[‘id’] = data[‘id’].astype(‘str’)
#data[‘is_crosspostable’] = data[‘is_crosspostable’].astype(‘str’)
#data[‘is_meta’] = data[‘is_meta’].astype(‘str’)
#data[‘is_original_content’] = data[‘is_original_content’].astype(‘str’)
#data[‘is_reddit_media_domain’] = data[‘is_reddit_media_domain’].astype(‘str’)
#data[‘is_self’] = data[‘is_self’].astype(‘str’)
#data[‘is_video’] = data[‘is_video’].astype(‘str’)
data[‘locked’] = data[‘locked’].astype(‘str’)
#data[‘media’] = data[‘media’].astype(‘str’)
#data[‘media_metadata’] = data[‘media_metadata’].astype(‘str’)
data[‘media_only’] = data[‘media_only’].astype(‘str’)
data[‘no_follow’] = data[‘no_follow’].astype(‘str’)
data[‘og_description’] = data[‘og_description’].astype(‘str’)
data[‘og_title’] = data[‘og_title’].astype(‘str’)
#data[‘over_18’] = data[‘over_18’].astype(‘str’)
data[‘permalink’] = data[‘permalink’].astype(‘str’)
data[‘score’] = data[‘score’].astype(‘str’)
#data[‘secure_media’] = data[‘secure_media’].astype(‘str’)
#data[‘secure_media_embed’] = data[‘secure_media_embed’].astype(‘str’)
data[‘selftext’] = data[‘selftext’].astype(‘str’)
#data[‘send_replies’] = data[‘send_replies’].astype(‘str’)
#data[‘spoiler’] = data[‘spoiler’].astype(‘str’)
#data[‘steward_reports’] = data[‘steward_reports’].astype(‘str’)
#data[‘stickied’] = data[‘stickied’].astype(‘str’)
#data[‘whitelist_status’] = data[‘whitelist_status’].astype(‘str’)
#data[‘wls’] = data[‘wls’].astype(‘str’)
#data[‘content_categories’] = data[‘content_categories’].apply(np.ravel)
#data[‘content_categories’] = data[‘content_categories’].astype(‘str’)
#data[‘content_categories’] = ’ '.join([str(elem) for elem in data[‘content_categories’]])
data[‘title’] = data[‘title’].astype(‘str’)
data[‘subreddit’] = data[‘subreddit’].astype(‘str’)
data[‘subreddit_id’] = data[‘subreddit_id’].astype(‘str’)
#data[‘subreddit_subscribers’] = data[‘subreddit_subscribers’].astype(‘str’)
data[‘subreddit_type’] = data[‘subreddit_type’].astype(‘str’)
data[‘url’] = data[‘url’].astype(‘str’)

output_table = data.copy()

0 Likes

#2

A brief update, I was able to get the following to work, however it gives an error if the number requested is above 100 or so: “Execute failed: ‘float’ object is not iterable”

For reference, this script seems to work in Jupyter. My guess is still that the error arises when being transferred from the python dataframe back to Knime

from pandas import DataFrame

Create empty table

from psaw import PushshiftAPI
import pandas as pd
import numpy as np

topicz = flow_variables[‘Topic’]
topic = ‘’.join([str(elem) for elem in topicz])

api = PushshiftAPI()
#gen = api.search_submissions(q=flow_variables[‘Topic’], limit=100)

gen = api.search_submissions(q=topic, limit=100)
#gen = api.search_submissions(q=‘fire’, limit=100)

data = pd.DataFrame([obj.d_ for obj in gen])
#print(data.dtypes)
output_table = data.drop([“media”, “secure_media”, “secure_media_embed”, “allow_live_comments”, “awarders”, “all_awardings”, “author_cakeday”, “author_flair_background_color”, “author_flair_css_class”, “author_flair_richtext”, “author_flair_template_id”, “author_flair_text”, “author_flair_text_color”, “author_flair_type”, “author_patreon_flair”, “can_mod_post”, “contest_mode”, “gildings”, “is_robot_indexable”, “parent_whitelist_status”, “pinned”, “post_hint”, “preview”, “pwls”, “retrieved_on”, “media_embed”, “subreddit”, “subreddit_id”, “subreddit_subscribers”, “subreddit_type”, “suggested_sort”, “thumbnail”, “thumbnail_height”, “thumbnail_width”, “title”, “total_awards_received”, “url”, “link_flair_background_color”, “link_flair_css_class”, “link_flair_richtext”, “link_flair_template_id”, “link_flair_text”, “link_flair_text_color”, “link_flair_type”, “crosspost_parent_list”, “crosspost_parent”, “send_replies”, “steward_reports”, “media_metadata”],axis=1,errors=‘ignore’)

#output_table[‘allow_live_comments’] = output_table[‘allow_live_comments’].astype(‘str’)
output_table[‘author’] = output_table[‘author’].astype(‘str’)
output_table[‘author_fullname’] = output_table[‘author_fullname’].astype(‘str’)
#output_table[‘awarders’] = output_table[‘awarders’].astype(‘str’)
#output_table[‘content_categories’] = output_table[‘content_categories’].astype(‘str’)
output_table[‘created’] = output_table[‘created’].astype(‘str’)
output_table[‘created_utc’] = output_table[‘created_utc’].astype(‘str’)
#output_table[‘crosspost_parent’] = output_table[‘crosspost_parent’].astype(‘str’)
#output_table[‘crosspost_parent_list’] = output_table[‘crosspost_parent_list’].astype(‘str’)
output_table[‘domain’] = output_table[‘domain’].astype(‘str’)
output_table[‘full_link’] = output_table[‘full_link’].astype(‘str’)
output_table[‘id’] = output_table[‘id’].astype(‘str’)
output_table[‘is_crosspostable’] = output_table[‘is_crosspostable’].astype(‘str’)
output_table[‘is_meta’] = output_table[‘is_meta’].astype(‘str’)
output_table[‘is_original_content’] = output_table[‘is_original_content’].astype(‘str’)
output_table[‘is_reddit_media_domain’] = output_table[‘is_reddit_media_domain’].astype(‘str’)
output_table[‘is_self’] = output_table[‘is_self’].astype(‘str’)
output_table[‘is_video’] = output_table[‘is_video’].astype(‘str’)
output_table[‘locked’] = output_table[‘locked’].astype(‘str’)
#output_table[‘media’] = output_table[‘media’].astype(‘str’)
#output_table[‘media_metadata’] = output_table[‘media_metadata’].astype(‘str’)
output_table[‘media_only’] = output_table[‘media_only’].astype(‘str’)
output_table[‘no_follow’] = output_table[‘no_follow’].astype(‘str’)
output_table[‘og_description’] = output_table[‘og_description’].astype(‘str’)
output_table[‘og_title’] = output_table[‘og_title’].astype(‘str’)
output_table[‘over_18’] = output_table[‘over_18’].astype(‘str’)
output_table[‘permalink’] = output_table[‘permalink’].astype(‘str’)
output_table[‘score’] = output_table[‘score’].astype(‘str’)
#output_table[‘secure_media’] = output_table[‘secure_media’].astype(‘str’)
#output_table[‘secure_media_embed’] = output_table[‘secure_media_embed’].astype(‘str’)
output_table[‘selftext’] = output_table[‘selftext’].astype(‘str’)
#output_table[‘send_replies’] = output_table[‘send_replies’].astype(‘str’)
output_table[‘spoiler’] = output_table[‘spoiler’].astype(‘str’)
#output_table[‘steward_reports’] = output_table[‘steward_reports’].astype(‘str’)
output_table[‘stickied’] = output_table[‘stickied’].astype(‘str’)
output_table[‘whitelist_status’] = output_table[‘whitelist_status’].astype(‘str’)
output_table[‘wls’] = output_table[‘wls’].astype(‘str’)
output_table[‘content_categories’] = output_table[‘content_categories’].astype(‘str’)

print(output_table.dtypes)
#print(output_table.head())
#print(topicz)

0 Likes

#3

Hi bigjoedata,

Yes, I believe so, too. Could you please upload your knime.log file here? I’d hope to find some stack trace in there that helps to pin down the underlying cause. Thanks!

Marcel

1 Like

#4

Log file attached. I did notice that it refers to “org.knime.python2.kernel.PythonKernel”. Not sure if that means it’s trying to use the py2 kernel instead of py3. I do have it configured as py3 though.

knime.log (9.1 KB)

0 Likes

#5

In your knime.log attachment, I see a traceback from your Python code suggesting it could not find the key 'og_description' in your pandas DataFrame:
KeyError: 'og_description'

Could it be possible that this column’s data is sometimes absent in the output from api.search_submissions(q=topic, limit=100) and hence does not always make it into the pandas DataFrame?

The error message you describe in your post (“Execute failed ‘float’ object is not iterable”) does not appear to be reflected in the log you supplied but working backwards through the code you shared, this might this originate from the line, data = pd.DataFrame([obj.d_ for obj in gen]) if gen turns out to be a float but I suspect it could be more likely from the line, topic = ‘’.join([str(elem) for elem in topicz]) if topicz were a float by mistake.

Given the nature of psaw as a tool for reading data from a remote, changing source (Reddit), this is somewhat difficult to help diagnose. That’s perhaps reflected in the error message you describe differing from the error message seen in the log file you attached. At present, I see nothing to suggest that there is any issue transferring the pandas DataFrame from Python to KNIME.

A few suggestions for things to try to chase down what is actually going on:

  1. As a diagnostic step, try running this code in the Python node without attempting to return the desired DataFrame at all. Return some static data instead. Try running the node repeatedly – if errors persist, you have proven the transfer of data back to KNIME is not the problem.
  2. As a diagnostic step, try running this code in Jupyter repeatedly and don’t involve KNIME at all. Can you reproduce any of the misbehaviors you’ve witnessed so far?
  3. Try handling the KeyError situation reported in your knime.log by putting all of your column manipulation work with output_table inside a try block – something like this might help:
for column_name in ['secure_media', 'send_replies', ...all the rest...]:
    try:
        output_table[column_name] = output_table[column_name].astype('str')
    except KeyError:
        pass

I hope the above is helpful. At this point, I am tempted to guess that the issues you are experiencing all arise from the variable behaviors of psaw itself, not that psaw is to be faulted but the nature of what it attempts to do, dynamically reading from Reddit, introduces such variations. If that guess is even partially right, it should be evident when running your code repeatedly whether KNIME is involved or not.

4 Likes

#6

Excellent, it is working now. Thank you so much @potts, your loop worked quite well and I appreciate your thorough response!

Here is the final code I used in case anybody else would like to use to easily pull from Reddit. I have tested it up to limit=10000 many times without issue, though I’ll probably continue to refine from here. Pushshift has a ton of potential! I am using this code within Knime to loop through a table of topics.

Once again, thanks to @

Copy input to output

from pandas import DataFrame

Create empty table

output_table = DataFrame()

from pandas import DataFrame

from psaw import PushshiftAPI
import pandas as pd
import numpy as np

topicz = input_table[‘Topic’]
topic = ‘’.join([str(elem) for elem in topicz])

api = PushshiftAPI()

gen = api.search_submissions(q=topic, limit=100, over_18=‘False’)

data = pd.DataFrame([obj.d_ for obj in gen])

for column_name in [‘created’, ‘created_utc’ ‘score’, ‘num_comments’, ‘num_crossposts’]:
try:
output_table[column_name] = data[column_name]
except KeyError:
pass

for column_name in [‘id’, ‘author’, ‘author_fullname’, ‘selftext’, ‘title’, ‘subreddit’, ‘subreddit_id’, ‘subreddit_type’, ‘url’, ‘permalink’, ‘domain’, ‘full_link’, ‘thumbnail’, ‘locked’, ‘media_only’, ‘no_follow’, ‘is_original_content’, ‘whitelist_status’, ‘crosspost_parent’, ‘crosspost_parent_list’]:
try:
output_table[column_name] = data[column_name].astype(‘str’)
except KeyError:
pass

output_table[‘topic’] = topic

2 Likes

closed #7

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

0 Likes