The Affair of the Vanishing Content

http://www.archive.org/
"Digitized information, especially on the Internet, has such rapid turnover these days that total loss is the norm. Civilization is developing severe amnesia as a result; indeed it may have become too amnesiac already to notice the problem properly."

(Stewart Brand, President, The Long Now Foundation )

Thousands of articles and essays posted by hundreds of authors were lost forever when themestream.com surprisingly shut its virtual gates. A sizable portion of the 1960 census, recorded on UNIVAC II-A tapes, is now inaccessible. Web hosts crash daily, erasing in the process valuable content. Access to web sites is often suspended - or blocked altogether - because of a real (or imagined) violation by the webmaster of the host's Terms of Service (TOS). Millions of other web sites - the results of collective, multi-annual, transcontinental efforts - contain unique stores of information in the form of databases, articles, discussion threads, and links to other web sites. Consider "Central Europe Review". Its archives comprise more than 2500 articles and essays about every conceivable aspect of Central and Eastern Europe and the Balkan. It is one of countless such collections.

Similar and much larger treasures have perished since the dawn of the digital age in the 1920's. Very few early radio and TV programs have survived, for instance. The current "digital dark age" can be compared only to the one which followed the torching of the Library of Alexandria. The more accessible and abundant the information available to us - the more devalued and common it becomes and the less institutional and cultural memory we seem to possess. In the battle between paper and screen, the former has won formidably. Newspaper archives, dating back to the 1700's are now being digitized - testifying to the endurance, resilience, and longevity of paper.

Enter the "Internet Libraries", or Digital Archival Repositories (DAR). These are libraries that provide free access to digital materials replicated across multiple servers ("safety in redundancy"). They contain Web pages, television programming, films, e-books, archives of discussion lists, etc. Such materials can help linguists trace the development of language, journalists conduct research, scholars compare notes, students learn, and teachers teach. The Internet's evolution mirrors closely the social and cultural history of North America at the end of the 20th century. If not preserved, our understanding of who we are and where we are going will be severely hampered. The clues to our future lie ensconced in our past. It is the only guarantee against repeating the mistakes of our predecessors. Long gone Web pages cached by the likes of Google and Alexa constitute the first tier of such archival undertaking.

The Stanford Archival Vault (SAV) in Stanford University assigns a numerical handle to every digital "object" (record) in a repository. The handle is the clever numerical result of a mathematical formula whose input is the number of information bits in the original object being deposited. This allows to track and uniquely identify records across multiple repositories. It also prevents tampering. SAV also offers application layers. These allow programmers to develop digital archive software and permit users to change the "view" (the interface) of an archive and thus to mine data. Its "reliability layer" verifies the completeness and accuracy of digital repositories.

The Internet Archive, a leading digital depository, in its own words:

"...is working to prevent the Internet