Half-baked ideas: reputation system for IP addresses

For other half-baked ideas, see my ideas tag.

I’m an obstinate log watcher. Watching web server logfiles in particular gives me a fascinating insight into how the bottom-feeders on the internet work, comment spammers, email harvesters, crap search engines and the like.

As a pretty random example, a single spammer (or more likely “illegal spam botnet”) just tried to fill in the comment form on one particular website I run 26 times in roughly 90 minutes. If you still have any myths about how sophisticated spammers are, read on.

Myth: spammers promote a particular website. Reality: spammers are still able to register huge numbers of random domains, and use very complex multi-step redirection.

Myth: spammers must operate from a limited set of IP addresses. Reality: spammers have access to virtually unlimited numbers of IP addresses.

Myth: each attack comes from a single IP address. Reality: attacks jump from IP addresses separated around the world, and those attacks are coordinated and look just like a single multi-step transaction, complete with correct cookies which must be passed between the hosts using a higher “back end” layer.

Myth: spambots don’t run Javascript, download images or solve captchas. Reality: …

The jury is still out on the last one. Certainly it’s not common, but a significant subset of comment spam does appear to come from real browsers, which run Javascript, download images and solve captchas. However I believe much or all of this must come from real people operating from sweatshops in countries with very low wages. That’s hard to tell just from looking at logfiles.

Each of the 26 completed transactions I saw involved multiple HTTP requests, and every single HTTP request came from a different IP address. But each completed transaction had a consistent cookie. In some cases the IP addresses were separated by half the earth, but HTTP requests followed each other in sub-second, indicating a sophisticated second level operation coordinating it all. Each request contained URLs for 4 websites, generated using random characters, and only some of these sites resolve.

So on to the half-baked idea.

Why don’t we have a proper, distributed reputation system for IP addresses?

A spammer can’t source an HTTP request from just any IP address, so they need to take over some grandma’s Windows PC, or someone’s web server, or persuade people to route some bogus AS. Every time an honest website owner (like me!) sees a bad IP, they register it.

Of course, spammers themselves will try to game the system, but they will do so from their own random IP addresses. We need to make sure that their “votes” count for less, and a reputation system should be able to decide this (eg. bad IP votes for bad IP? those votes count negatively).

If grandma tries to post a good comment, her IP may well cause that comment to be rejected. Good thing! She needs to clean up her (Windows) PC.

And what about ISPs who rotate IP addresses between good and bad customers? Those ISPs need to police their users and make sure they clean up their Windows PCs, or force the users on to better operating systems that don’t allow these exploits.

Note There are people classifying IPs now, eg. project honeypot and stop forum spam, but these guys don’t implement a reputation system and in some cases have nasty licensing terms which make the data that we provide for free into proprietary databases. No thanks.

5 Comments

Filed under Uncategorized

5 responses to “Half-baked ideas: reputation system for IP addresses

  1. For the captcha, I have heard that they use porn site. People solving captcha on porn site or others sites ( illegal download, etc ) are in fact solving captcha for the spammers.

    But the sweatshops theory is likely to be true too.

    • rich

      Apply some critical thinking to this: The spammers would need to be also operating porn/downloading sites with approximately the same or greater volume than their spamming operations. They would also need to match up requests to spam with incoming requests for porn. This doesn’t make sense given what I see in the log files. So I would add this to yet another myth. Maybe some researcher said it, and maybe it was picked up on slashdot, but it doesn’t match my observations.

      But the sweatshops theory is likely to be true too.

      This is far more likely.

  2. Mads

    That is a good idea.

    With postgrey on my mail server I already kind of create my own reputation system and end up with a whitelist and a blacklist. There is no reason (except privacy …) that I couldn’t share my perceived reputation with others.

    Others should of course consider _my_ reputation. So the reputation of my reported reputations should depend on my reputation. So this boils down to being a kind social-network-problem. Reputations with reputations could perhaps be combined with some kind of Bayesian filtering.

    I wonder how many entries a typical site would have to store in its own white- and black-list?

  3. “every single HTTP request came from a different IP address”.

    Would using Tor give that effect?

  4. rich

    Malcolm: Yes, tor does jump around like that.

    However tor also allows you to get a list of all exit nodes that can reach your IP, and I already had those blocked.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.