Half-baked idea: Content-addressable web proxy

For more half-baked ideas, see my ideas tag

There are several situations where you want to fetch some content and don’t particularly care which precise source it comes from:

  1. Downloading packages from Linux distro mirrors.
  2. Downloading git commits.
  3. Grabbing a bittorrent data block.

My proposal (which surely has been done??) is that clients can supply the hash of the file they want when connecting to a proxy; something like:

GET http://example.com/foo HTTP/1.1
Content-Hash: sha256 b32683017c9530[etc]

The proxy is entitled to return any object in its cache that has the desired hash. If it doesn’t have any such object then it’ll fetch it from the URI in the usual way. We’ll have to make some assumptions that only cryptographically strong hashes are allowed, both to prevent the client getting wrong data and to stop clients fishing for unauthorized files from the cache.

In the distro mirroring case, the metadata would contain the hashes of the packages (which it probably already does). The client would supply these to the proxy. The proxy would be able to satisfy the request no matter what mirror was selected — you wouldn’t get the situation where the proxy is downloading several copies of the same data from different mirrors.

In the git case, git commits are the hashes. This would finally let us have an intelligent git mirror, something I’ve been wanting for a while given that I’m on slow DSL and downloading gnulib multiple times per day is no fun for anyone.

8 Comments

Filed under Uncategorized

8 responses to “Half-baked idea: Content-addressable web proxy

  1. H. Peter Anvin

    Sounds like you want to use a URN (as opposed to URL) scheme.

  2. There is PeerDist protocol (from Microsoft) that deals with hash-based caching and retrieval of the data over variety of transports.

    Chris Hertel (Samba Team, Red Hat Storage) works on free software implementation called Prequel: http://ubiqx.org/proj/Prequel/ and https://fedorahosted.org/prequel/ for source code and http://msdn.microsoft.com/en-us/library/dd303704.aspx for the protocol spec.

  3. Tobu

    Looks like content-centric networking, here’s a quick description:
    In the PARC vision of CCN, content is divided into packet-size chunks identified by a unique name with a particular hierarchical structure. The name and content can be cryptographically encoded and signed, providing a range of security levels. Packets in CCN carry names rather than addresses and this has a fundamental impact on the way the network works.

    Security concerns are addressed at the content level, relaxing requirements on hosts and the network. Users no longer need a universally known address, greatly facilitating management of mobility and intermittent connectivity. Content is supplied under receiver control, limiting scope for denial of service attacks and similar abuse. Since chunks are self-certifying, they can be freely replicated, facilitating caching and bringing significant bandwidth economies. from https://team.inria.fr/rap/research/ccn/

    http://conferences.sigcomm.org/co-next/2009/papers/Jacobson.pdf and http://www.ccnx.org/
    http://tools.ietf.org/html/rfc6392 and http://irtf.org/icnrg

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s