For other half-baked ideas, see my ideas tag.
I’m an obstinate log watcher. Watching web server logfiles in particular gives me a fascinating insight into how the bottom-feeders on the internet work, comment spammers, email harvesters, crap search engines and the like.
As a pretty random example, a single spammer (or more likely “illegal spam botnet”) just tried to fill in the comment form on one particular website I run 26 times in roughly 90 minutes. If you still have any myths about how sophisticated spammers are, read on.
Myth: spammers promote a particular website. Reality: spammers are still able to register huge numbers of random domains, and use very complex multi-step redirection.
Myth: spammers must operate from a limited set of IP addresses. Reality: spammers have access to virtually unlimited numbers of IP addresses.
Myth: each attack comes from a single IP address. Reality: attacks jump from IP addresses separated around the world, and those attacks are coordinated and look just like a single multi-step transaction, complete with correct cookies which must be passed between the hosts using a higher “back end” layer.
Myth: spambots don’t run Javascript, download images or solve captchas. Reality: …
The jury is still out on the last one. Certainly it’s not common, but a significant subset of comment spam does appear to come from real browsers, which run Javascript, download images and solve captchas. However I believe much or all of this must come from real people operating from sweatshops in countries with very low wages. That’s hard to tell just from looking at logfiles.
Each of the 26 completed transactions I saw involved multiple HTTP requests, and every single HTTP request came from a different IP address. But each completed transaction had a consistent cookie. In some cases the IP addresses were separated by half the earth, but HTTP requests followed each other in sub-second, indicating a sophisticated second level operation coordinating it all. Each request contained URLs for 4 websites, generated using random characters, and only some of these sites resolve.
So on to the half-baked idea.
Why don’t we have a proper, distributed reputation system for IP addresses?
A spammer can’t source an HTTP request from just any IP address, so they need to take over some grandma’s Windows PC, or someone’s web server, or persuade people to route some bogus AS. Every time an honest website owner (like me!) sees a bad IP, they register it.
Of course, spammers themselves will try to game the system, but they will do so from their own random IP addresses. We need to make sure that their “votes” count for less, and a reputation system should be able to decide this (eg. bad IP votes for bad IP? those votes count negatively).
If grandma tries to post a good comment, her IP may well cause that comment to be rejected. Good thing! She needs to clean up her (Windows) PC.
And what about ISPs who rotate IP addresses between good and bad customers? Those ISPs need to police their users and make sure they clean up their Windows PCs, or force the users on to better operating systems that don’t allow these exploits.
Note There are people classifying IPs now, eg. project honeypot and stop forum spam, but these guys don’t implement a reputation system and in some cases have nasty licensing terms which make the data that we provide for free into proprietary databases. No thanks.
Half-baked ideas: reputation system for IP addresses
For other half-baked ideas, see my ideas tag.
I’m an obstinate log watcher. Watching web server logfiles in particular gives me a fascinating insight into how the bottom-feeders on the internet work, comment spammers, email harvesters, crap search engines and the like.
As a pretty random example, a single spammer (or more likely “illegal spam botnet”) just tried to fill in the comment form on one particular website I run 26 times in roughly 90 minutes. If you still have any myths about how sophisticated spammers are, read on.
Myth: spammers promote a particular website. Reality: spammers are still able to register huge numbers of random domains, and use very complex multi-step redirection.
Myth: spammers must operate from a limited set of IP addresses. Reality: spammers have access to virtually unlimited numbers of IP addresses.
Myth: each attack comes from a single IP address. Reality: attacks jump from IP addresses separated around the world, and those attacks are coordinated and look just like a single multi-step transaction, complete with correct cookies which must be passed between the hosts using a higher “back end” layer.
Myth: spambots don’t run Javascript, download images or solve captchas. Reality: …
The jury is still out on the last one. Certainly it’s not common, but a significant subset of comment spam does appear to come from real browsers, which run Javascript, download images and solve captchas. However I believe much or all of this must come from real people operating from sweatshops in countries with very low wages. That’s hard to tell just from looking at logfiles.
Each of the 26 completed transactions I saw involved multiple HTTP requests, and every single HTTP request came from a different IP address. But each completed transaction had a consistent cookie. In some cases the IP addresses were separated by half the earth, but HTTP requests followed each other in sub-second, indicating a sophisticated second level operation coordinating it all. Each request contained URLs for 4 websites, generated using random characters, and only some of these sites resolve.
So on to the half-baked idea.
Why don’t we have a proper, distributed reputation system for IP addresses?
A spammer can’t source an HTTP request from just any IP address, so they need to take over some grandma’s Windows PC, or someone’s web server, or persuade people to route some bogus AS. Every time an honest website owner (like me!) sees a bad IP, they register it.
Of course, spammers themselves will try to game the system, but they will do so from their own random IP addresses. We need to make sure that their “votes” count for less, and a reputation system should be able to decide this (eg. bad IP votes for bad IP? those votes count negatively).
If grandma tries to post a good comment, her IP may well cause that comment to be rejected. Good thing! She needs to clean up her (Windows) PC.
And what about ISPs who rotate IP addresses between good and bad customers? Those ISPs need to police their users and make sure they clean up their Windows PCs, or force the users on to better operating systems that don’t allow these exploits.
Note There are people classifying IPs now, eg. project honeypot and stop forum spam, but these guys don’t implement a reputation system and in some cases have nasty licensing terms which make the data that we provide for free into proprietary databases. No thanks.
5 Comments
Filed under Uncategorized
Tagged as comment spam, forum spam, HTTP, ideas, IP address, logfiles, reputation system, spam, website