the perishable masses

Posted on July 20, 2007

bobbarker.jpgI wonder if there's a lunatic out there who is frantically printing out page after page of the Internet for fear of its impending destruction. This destruction could be for any number of reasons, including politics, nuclear war, or alien invasion. Perhaps he sits alone in his Unabomber-style shack with a couple of dozen inkjets, crawling the web and harvesting sites to preserve our modern culture for posterity.

Doubtless, there are institutions eager to maintain archival copies of websites; a quick Google search yielded the International Internet Preservation Consortium and a paper by Margaret Phillips entitled "The Preservation of Internet Publications," which essentially advocates the work of groups such as the aforementioned.

As an aside, it would make me quite happy (even giddy) if you're reading this a number of years in the future on any location other than the server hosting, which is currently located somewhere in the southwest U.S. It would make me somewhat happy if this article were still formatted in HTML and the links above were no longer functional because the Internet Preservation Consortium was unable to maintain its self-preservation.

The problem is that none of the resources to which I gave a cursory glance are eager to create physical copies of their massive hordes of information. Phillips' paper seems to make a fleeting mention of the concept (emphasis added), although in passing it is decried by the author:

We also need preservation techniques including refreshing, migration, emulation and possibly (although this is not a favoured option) hardware and software museums.

Archiving the Internet in physical form would indubitably be a momentous and costly undertaking, but might it be worth it in the long run? Phillips raises several important issues with web archiving which are not necessarily restricted to the physical manifestation of the Internet, not the least of which raises the question of who should be in charge of such a task. Independent, non-government sponsored, international committees are probably best suited to the work, but their funding would surely be limited.

In many ways, losing the information from the Internet would not greatly diminish the profundity of mankind. Although exact figures are currently unknown, much of the Internet may fall into several disposable categories: pornographic, advertising-related, redundant, superfluous, or patently offensive. Preserving these materials in a datacenter with nearly limitless amounts of virtual storage space is one thing, but physical copies of such material would be unnecessary. Still, the proposition of an agency, consortium, or commission scouring what could be hundreds of billions or even trillions of pages of material and then selectively printing copies seems unfathomable.

It is not, however, unfathomable to see some crazy little man doing it on his own. Perhaps one day this will be revealed to be the case, and I want to personally meet the guy so I can have a good laugh at him.

Filed under: tech Leave a comment
Comments (4) Trackbacks (0)
  1. I know that this isnt your idea, and Im sure that you can see the fallacy in all this. But there are a few fatal flaws in the whole plan of saving a replica of the internet.

    Practically, its like trying to save a person. If you were to think of the internet as a human nervous system, and each entity thereon as a cell, you would have to have cross-sections of the internet for every time the structure of the internet changed. Every time someone logs in to the internet, every time a server is rebooted, every time an html tag is changed. In some studies of DNA (in worms), the storage of a few seconds of one cell can take up 60mB or more. Considering the width and breadth of the internet, just an instant of time of the internet would be bordering a Terrabyte. The first time Googles crawler was set loose on the internet, it took it 9 days to do a preliminary scan. Seeing as how the internet inevitably had changed in that time, you would be left with a 9 day period in which you got a wave of instances of each page that was viewed.

    Another practical problem is that server side data (and therefore the behavior that will differ from viewer to viewer) is obstructed from crawlers. CSS, javascript, and the like would not be capable of being replicated unless given server side access to the servers. Also, there are an ever increasing number of websites which store personal, private, and secure data. No crawler would be able to have access to these areas of the website without having copies of all of the databases behind the websites.

    The last largest practical problem (that isnt particularly obvious): How the hell would you go around trying to display all of this information? Would you just have an internet browser, where you would set the date, enter a web site, and off you go? To be able to do something like this, you would have to have complete copies of the internet for every possible time you could set this browser to.

    In short, saving copies of all the versions of all the web pages on the internet would be like going to every software development company, getting versions of every program they were making, and then trying to figure out a way to use them all. The internet is a living entity, and as such, it will continue to change in unforeseeable ways. I do think that keeping a definitive history of the technologies used in creation of the internet would be a cool thing. (But those things are already documented in APIs, and ganders of educational books.)

  2. Well the snapshots of the web would just have to be taken every few years, since thats the only remotely feasible way of doing it. It would be just like having a photo album of a child at age 5, then again at age 10, etc. Just because its constantly changing doesnt mean you couldnt theoretically save the Internet at a given point. You wouldnt have to have every possible time frame covered; gaps in time would have to be acceptable. But the whole concept of having a building filled with filing cabinets containing innumerable pages of printed web content is what I was after, and one that will surely never be attempted.

    But in my view, none of the server-side stuff would be important anyway, since the content is all that matters. Saving the layout of a page would be nice, but the words or images on the page are the bottom line.

    Ganders of educational books? ;)

  3. Im sure that the major search engines keep at least an abridged history of pages somewhere. Or I dont see why it wouldnt be feasible for them to do so. I do think that a lot of the content on the web is generated server side (especially with the advent of XHTML, DHTML, javascript, etc. It would be hard to even get an idea of the interactions on webpages without a way to save these things also. There are some websites where the content is extremely massive, but you have to be logged in to do see most of it (social networks, for one). Youre right that it would be like a photo album, I suppose. Its just really hard to capture the true essence of a moment in time, a picture is probably the best we can do for a long while.

    Haha, I meant gaggle! Oh well. Gander sounded pretty ridiculous at the time I was using it, but I couldnt think of the right word.

  4. close your parentheses, asshole.

Leave a comment


* To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word. Click to hear an audio file of the anti-spam word

No trackbacks yet.