How to pull data from a website that does not have an API?

Hi all, I need to pull data from a website that does not provide an API - Site Map - Multpl

May I request some help on how to achieve this goal? I need to extract the table from the monthly data page from all the links in the above sitemap.

Example page: S&P 500 PE Ratio by Month - Multpl

Hi @tone_n_tune

Have a look at this response. Maybe it can help you.

Best regards

1 Like

Here’s a workflow which may help get you started. My Xpath skills are pretty limited. I’ve been unable to get the values to parse correctly. Maybe someone with better Xpath knowledge can help. The xml seems to be pretty poorly formed.
Simple Web Scraper.knwf (112.8 KB)



Here’s a csv file from this website:

sp-500-pe-ratio-price-to-earnings-chart.csv (23.0 KB)

Blockquote

1 Like

@tomljh provided a solution.
Simple Web Scraper 2.knwf (10.3 KB)

2 Likes

Incredible help!

Can the “value” column be converted to Type Double in the XPath node? I changed it in the XPath option but got an error.

Easiest way is to add a String to Number node after the String Cleaner. Please mark @tomljh’s solution “solved”.

Identifying the problem and solving the main issue were your contributions. I only made a little improvement. :slightly_smiling_face:

1 Like

Thanks, but it is still failing because some rows contain strings with % sign that can not be converted to double. This is happening for these 3 particular pages - S&P 500 Earnings Growth Rate by Quarter - Multpl and S&P 500 Sales Per Share Growth by Quarter - Multpl and S&P 500 Earnings Yield by Month - Multpl

Strip out the % signs with the String Manipulator node, then convert.

I actually used the setting in the “String Cleaner” node -
image

Better yet! Its a relatively new node and I keep forgetting about it.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Hi @rfeigel , the Date output from this workflow differs from the typical YYYY-MM-DD format. I tried to use the String to Date&Time node with the below options selected:

But it did not work. May I request help in obtaining the Date column data in e.g., “2007-04-12” format?

Hi @tone_n_tune, the String to Date and Time node requires that you specify the input format of the data as it appears in the string to be converted. (Dates themselves have no “format” as such, but when displayed in KNIME, are rendered in the yyyy-MM-dd format that you mentioned.)

In your screenshot, KNIME is telling us that the first cell contains a string “Jul 18, 2024”. Looking back on the earlier response from @rfeigel, this corresponds to the dates shown in the string “Date” column.

image

Therefore, the format that you need to specify is
MMM d, yyyy

A single d is used here, as this allows for single and two-digit day, whereas dd would require all day values to be two digits, eg 01,02…10…

This will then tell KNIME how to interpret the data, and it should then be able to convert it to a date.

If you want the date to be output in a specific format, other than the default, you would need to convert it back to a string using Date&Time to String, specifying the output format that you require.

  • If this resolves the new question, please leave @rfeigel’s response marked as the solution as that appears to solve the actual question posed on this thread, and only one post may be marked as the solution.

Ideally once a question has been resolved, if you have a new question which is not specifically part of the original question (e.g. in this case the original question was only concerned with how to pull data without an API, and didn’t specify anything about requiring a particular date format) it is better generally to open a new specific question.

This is for several reasons:

  • People don’t always go looking at “solved” questions and so you may not get an answer so quickly
  • There may be people who could quickly answer the new question (date conversions) who would not know anything about “how to pull data from a website that does not have an API” and so you reduce the potential responders.
  • Somebody else in future with a question about date formatting probably won’t be able to find this answer so easily.
2 Likes

Thanks! I have also found this example workflow helpful - KNIME_Workflow With flexible DATE format mask – KNIME Community Hub

1 Like