I am just wondering what is the influence of the fingerprint length. By default, you set the number of bits to 1024, I guess there is a reason. However, I did some test to compare 1024 and 512 bit fingerprints, and I do not see a big difference (similarity test, and QSAR). I do machine learning and reducing the number of bits makes my models faster.
I am no computational modeller, who would be best to answer this, but from my understanding, each bit helps encode a part of the molecule, such as atom environments along a path with Morgan fingerprints for instance, or number and type of functionalities with MACCS fingerprints.
I believe if the bits are reduced in amount you risk two different atom environments being encoded with the same bits. Thus two molecules could be identified as more similar than they actually are.
In terms of what's the most optimum and efficient number of bits, I have no idea. I would guess it's 1024 since that seems to be the most common usage.
For small molecules 512-1024 usually is enough. Which one to use depends on your specific case... Usually though nearly never less than 512.
You can test it by using one of the larger molecules in your set and checking how many bits are set. If you use 512 bits and 240 bits are set, then 1024 won't do more. But if you would use say 128 bits FP only, then it wouldn't be enough.
So, if I understand, the length of the fingerprint is also (not only) related to the molecule size or complexity, because the more atoms a molecule has, the more substructures there are. Am I rigth?
I wouldnt say the right wording is "Substructures", but more "patterns"!
Fingerprints generally dont do substructure analysis, but usually record the connecting of each atom in turn (Morgan approach). So it will take one atom, and look what environment is directly around it in terms of other atoms (ie. what atoms is connected to it, this is of path 1), each of these atoms then have their environments assessed in the same way (this is of path 2), and this continues for the path that is defined. This long complex pattern is then written down in bits to explain that ONE atom's enviroment. Now it continues onto the next atom, and so on. So obviously the more heavy atoms, the more patterns its got to write down.
Also note that each bit in the fingerprint represents one of these patterns found. The more compounds, and the more complex they are, the more different patterns will be found and thus the more bits will be utilized.