Using IP address ranges for selection/exclusion of IP adresses

kludikovsky · September 30, 2022, 12:02pm

I am looking for a solutionwhich can do a selection (exclusion) of IP addresses against IP ranges/Subnet mask/ CIDR (a.b.c.d/x).
The solution may be a node, a component, written as R or Python routine, or whatsoever.

I’d like to sort out individual IP’s against IP ranges (as well as IP’s).
Typical Log-File and Firewall processing.
Preferably for IPv4 and IPv6 but not necessary.

I have so far not found anything relating to that.

Any solution, hint or guidance is welcome.
Thank you.

bruno29a · September 30, 2022, 12:13pm

Hi @kludikovsky , can you please confirm what the input data would be?

If you want to compare some IPs against a list of IPs, you can use the Joiner node - Inner join will give you only the IPs that exist in both list, and a Left join (or Right join, depending which list is first or second) will give you both the set of IPs that exist and those who do not exist in the other list. You can then split or filter accordingly.

If you want some manual selection, you can use the Table View node for interaction.

Without understanding your data and fully understand your use case, I can’t put something/demo together

kludikovsky · September 30, 2022, 1:08pm

IP subnet masks (also known ist CIDR look like this (as an example)

10.233.99.77/26

and should be compared against an IP (eg.)
10.233.99.67
to be included, while
10.233.99.61 and 10.233.99.128 not to be included.

The netmask 10.233.99.77/26
includes all IP’s from 10.233.99.64 to 10.233.99.127

This is just an example. The /x can be any number from 0 to 32 (for IPv4). For IPv6 there is a similar schema.

bruno29a · September 30, 2022, 1:16pm

Hi @kludikovsky , not coming from the world you are talking about, I have a few questions:

1:

Is the range of 10.233.99.64 to 10.233.99.127 given, or the workflow has to determined this based on the netmask? If so, how is it determined? And if it’s given, do we get all the IPs within the ranges, or are we given only the limits of the range (start and end)? (That’s why it’s best if you could show what the input data looks like)

Can you explain what is the impact of the /x? Would whatever needs to be done be different if let’s say we had /32 instead of the /26?

kludikovsky · September 30, 2022, 1:37pm

Please have a look at the above give link

there is everything described.
As a little playground try

One example of the data (forget that this is JSON - i shall be able to convert this and there are other forms like csv, etc.) can be found at
https://developers.google.com/static/search/apis/ipranges/googlebot.json

bruno29a · September 30, 2022, 1:38pm

So… what’s your input data?

kludikovsky · September 30, 2022, 1:51pm

Data in the form of (as strings)

for IPv4

0.0.0.0/24
1.0.4.0/22
1.0.64.0/18
1.0.128.0/17
1.1.1.0/24
1.1.64.0/19
1.1.96.0/20
1.1.112.0/21
1.1.120.0/22
1.1.124.0/23
1.1.126.0/24
1.1.128.0/17
1.2.2.0/24
1.2.4.0/24
1.2.128.0/17
1.4.128.0/17
1.5.0.0/16
1.6.0.0/17
1.6.128.0/21
1.6.136.0/24
1.6.137.0/24
1.6.138.0/23
1.6.140.0/22
1.6.144.0/20
1.6.160.0/19
1.6.192.0/20
1.6.208.0/21
1.6.216.0/23
1.6.218.0/24
1.6.219.0/24

for IPv6

2001:200::/37
2001:200:800::/40
2001:200:900::/40
2001:200:a00::/39
2001:200:c00::/39
2001:200:e00::/40
2001:200:f00::/40
2001:200:1000::/36
2001:200:2000::/35
2001:200:4000::/34
2001:200:8000::/34
2001:200:c000::/35
2001:200:e000::/35
2001:218::/35
2001:218:2000::/39
2001:218:2200::/40
2001:218:2300::/40
2001:218:2400::/38
2001:218:2800::/37
2001:218:3000::/46
2001:218:3004::/48
2001:218:3005::/48
2001:218:3006::/47
2001:218:3008::/45
2001:218:3010::/44
2001:218:3020::/43
2001:218:3040::/42
2001:218:3080::/41
2001:218:3100::/40
2001:218:3200::/39
2001:218:3400::/38
2001:218:3800::/37
2001:218:4000::/34
2001:218:8000::/33
2001:240::/32
2001:250::/47
2001:250:2::/48
2001:250:3::/48
2001:250:4::/48
2001:250:5::/48
2001:250:6::/47
2001:250:8::/45
2001:250:10::/44

But you need to understand the CIDR/subnet maks to handle this.

My question was, if there is something available yet for IP subnetmask matching.

bruno29a · September 30, 2022, 6:17pm

Thanks for the additional info @kludikovsky .

I don’t think there’s anything available for subnetmask matching. And indeed, it would require some understanding of the CIDR/subnet mask to build something.

takbb · September 30, 2022, 10:18pm

Hi @kludikovsky , I don’t know if this will be of assistance but a while back in response to another forum question I put a component on the hub which will generate all ip addresses with a given range. This only does ipv4 addresses. I wouldn’t recommend it for huge range but if you are trying to determine if a particular address is within a given subnet it might be adaptable for doing that, eg by generating a range and then seeing if an address can be joined to any of the generated values.

See this forum thread for details of

kludikovsky · October 1, 2022, 12:59pm

@bruno29a thank for your help. I was asking as I haven’t found anything on the hub and in the docu, so as I am new to KNIME, I thought maybe there are other sources.

And yes it needs a background for the subnet masking. Especially as I need in a current challenge to mask more the 60.000 records. I found already sources for the process, which I will most probably implement as an R node (there I don’t need loops for most of the operations ).

kludikovsky · October 1, 2022, 1:02pm

@takbb thank you for your hint. I have found this component before. But as will need to run ten’s of subnets against more than 60.000 records, this will outblow the generated numbers. So as I already mentioned to @bruno29a I will try to write an R-node solving the issue.

badger101 · October 1, 2022, 1:40pm

If you’re willing to have a go at Go, you can type in “golang subnet” on Stackoverflow and you’ll see many recommendations for the task.

Best wishes on your project. Adios!

takbb · October 1, 2022, 11:12pm

Hi @kludikovsky,

I’m not sure exactly what form your data flow is likely to take and couldn’t decide if you had a list of individual ip addresses to compare against a subnet mask, or the other way round.

I thought I’d have a play though just to see how such a tool could be written with java, and as a proof of concept this is a small java snippet within a component. It might not be exactly what you need right now, but maybe this can be adapted. If you could give an example of how you envisage the data to be supplied (list of subnets vs list of addresses) - then I (or somebody else) can maybe rework it

I tried to upload the component to the hub but hit a server error, so for now I’m just including it here in a demo workflow. Give it a try and see if the basic calculation works (it uses a very simple piece of java with an apache library for doing the subnet stuff). Then if you want it extended to work in a slightly different way that may be possible.

Test ip address in subnet.knwf (3.4 MB)

I have yet to embark on IP v6. One step at a time…

kludikovsky · October 2, 2022, 9:04am

@takbb Applause . This is going into the right direction.

What’s the background what I want to do:
I want to filter from log-files certain IP-addresses from address-ranges which are typical crawler, known IP’s, etc.
Those IPs and subnets (the google- link above are the google bots and can be downloaded, so can others as well, or manually included) will be loaded into tables.
The log-IP’s shall then be matched against those subnets/IP and marked so that in a next step can be in- or excluded (depending on what the purpose ist) for further processing.

With my small log set of about 60.000 records and tens or possibly hundreds of subnets, this will already be quite a task for the CPU’s (doesn’t matter for the beginning).

What I have found somewhere (as usual can’t find it again) is a conversion of the subnet start-address to decimal to get the start beginn of the range and then add the range to it to get the end-address. Then the process will by reduced to a comparision of (pseudo code)
if subnet_start <= IP <= subnet_end then "its in" else "its out"

By sorting the data-sets upfront this can be performed in a one pass, instead of every check going through all possibilities.

In the first step IPv4 would be ok, upgradeable to IPv6 would be nice.

takbb · October 2, 2022, 11:49am

Hi @kludikovsky

Thanks for the feedback. I haven’t tried to download and convert the list of IP subnets but instead have just created small tables of sample data. One is a list of subnets and the other a list of IP addresses.

I have taken from ideas in your last post, and combined with my component idea to produce a new component “IP Address to Range”. The idea is that given a single ip address or subnet, it generates the range Low and High values. (If it is just given a single ip address, the low and high values generated are the same). It returns both low and high as both dot-separated and Long values.

For single IP addresses:

For subnets:

The configuration for the IP Address to Range component, is this:

So you simplify the column containing the address/subnet to be converted, and also give it a column-name prefix for the generated columns. This prefix is useful for identifying the columns to compare later.

In the attached workflow, I’ve then compared the two outputs as follows, using a Cross Joiner which will compare every row with every row. This is used because the standard Joiner node doesn’t allow us to perform “between” joins. However this can work surprisingly well even for (relatively) large data sets but obviously there will be an upper limit on the size of lists you’d want to compare this way, as you do get the product of all the rows from the two inputs.

To keep it small, you could potentially put the comparison into a “chunked loop” so that it only compares a small number of ip addresses with the full list of subnets at any one time, to keep memory usage down.

An alternative to the cross joiner, if you have Python 3 installed with your KNIME environment, is to use my “PandaSQL join” component. This takes the two data sets and uses SQL with a between statement to join them. You do need to know some basic sql. You join T1 (top input) with T2 (bottom input)

You will also need the following python packages installed:
numpy,
pandas,
pandasql
I think KNIME requires numpy and pandas anyway, so probably pandasql is the only additional package needed if you already have python.

With my sample data set, the end result is as follows:

If you modified the SQL in the PandaSQL component to:

select t1."IP Address", t2."Subnet Address"
from t1
left join t2 on t1."ip_Low Address Long" 
between t2."sub_Low Address Long" and t2."sub_High Address Long"

you could have the following output…

Here is the workflow:
Test ip address in subnet.knwf (3.4 MB)

Would something like that work for you?

The components I have used are:

kludikovsky · October 2, 2022, 1:12pm

@takbb this sounds excellent.
I have Python installed and I am SQL aware.
I will try it and I think this will solve my challenge.
Thank you very much!

system · December 31, 2022, 1:12pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.