The Metaphors of the Net

I. The Genetic Blueprint

A decade after the invention of the World Wide Web, Tim Berners-Lee is promoting the "Semantic Web". The Internet hitherto is a repository of digital content. It has a rudimentary inventory system and very crude data location services. As a sad result, most of the content is invisible and inaccessible. Moreover, the Internet manipulates strings of symbols, not logical or semantic propositions. In other words, the Net compares values but does not know the meaning of the values it thus manipulates. It is unable to interpret strings, to infer new facts, to deduce, induce, derive, or otherwise comprehend what it is doing. In short, it does not understand language. Run an ambiguous term by any search engine and these shortcomings become painfully evident. This lack of understanding of the semantic foundations of its raw material (data, information) prevent applications and databases from sharing resources and feeding each other. The Internet is discrete, not continuous. It resembles an archipelago, with users hopping from island to island in a frantic search for relevancy.

Even visionaries like Berners-Lee do not contemplate an "intelligent Web". They are simply proposing to let users, content creators, and web developers assign descriptive meta-tags ("name of hotel") to fields, or to strings of symbols ("Hilton"). These meta-tags (arranged in semantic and relational "ontologies" - lists of metatags, their meanings and how they relate to each other) will be read by various applications and allow them to process the associated strings of symbols correctly (place the word "Hilton" in your address book under "hotels"). This will make information retrieval more efficient and reliable and the information retrieved is bound to be more relevant and amenable to higher level processing (statistics, the development of heuristic rules, etc.). The shift is from HTML (whose tags are concerned with visual appearances and content indexing) to languages such as the DARPA Agent Markup Language, OIL (Ontology Inference Layer or Ontology Interchange Language), or even XML (whose tags are concerned with content taxonomy, document structure, and semantics). This would bring the Internet closer to the classic library card catalogue.

Even in its current, pre-semantic, hyperlink-dependent, phase, the Internet brings to mind Richard Dawkins' seminal work "The Selfish Gene" (OUP, 1976). This would be doubly true for the Semantic Web.

Dawkins suggested to generalize the principle of natural selection to a law of the survival of the stable. "A stable thing is a collection of atoms which is permanent enough or common enough to deserve a name". He then proceeded to describe the emergence of "Replicators" - molecules which created copies of themselves. The Replicators that survived in the competition for scarce raw materials were characterized by high longevity, fecundity, and copying-fidelity. Replicators (now known as "genes") constructed "survival machines" (organisms) to shield them from the vagaries of an ever-harsher environment.

This is very reminiscent of the Internet. The "stable things" are HTML coded web pages. They are replicators - they create copies of themselves every time their "web address" (URL) is clicked. The HTML coding of a web page can be thought of as "genetic material". It contains all the information needed to reproduce the page. And, exactly as in nature, the higher the longevity, fecundity (measured in links to the web page from other web sites), and copying-fidelity of the HTML code - the higher its chances to survive (as a web page).

Replicator molecules (DNA) and replicator HTML have one thing in common - they are both packaged information. In the appropriate context (the right biochemical "soup" in the case of DNA, the right software application in the case of HTML code) - this information generates a "survival machine" (organism, or a web page).

The Semantic Web will only increase the longevity, fecundity, and copying-fidelity or the underlying code (in this case, OIL or XML instead of HTML). By facilitating many more interactions with many other web pages and databases - the underlying "replicator" code will ensure the "survival" of "its" web page (=its survival machine). In this analogy, the web page's "DNA" (its OIL or XML code) contains "single genes" (semantic meta-tags). The whole process of life is the unfolding of a kind of Semantic Web.

In a prophetic paragraph, Dawkins described the Internet:

"The first thing to grasp about a modern replicator is that it is highly gregarious. A survival machine is a vehicle containing not just one gene but many thousands. The manufacture of a body is a cooperative venture of such intricacy that it is almost impossible to disentangle the contribution of one gene from that of another. A given gene will have many different effects on quite different parts of the body. A given part of the body will be influenced by many genes and the effect of any one gene depends on interaction with many others...In terms of the analogy, any given page of the plans makes reference to many different parts of the building; and each page makes sense only in terms of cross-reference to numerous other pages."

What Dawkins neglected in his important work is the concept of the Network. People congregate in cities, mate, and reproduce, thus providing genes with new "survival machines". But Dawkins himself suggested that the new Replicator is the "meme" - an idea, belief, technique, technology, work of art, or bit of information. Memes use human brains as "survival machines" and they hop from brain to brain and across time and space ("communications") in the process of cultural (as distinct from biological) evolution. The Internet is a latter day meme-hopping playground. But, more importantly, it is a Network. Genes move from one container to another through a linear, serial, tedious process which involves prolonged periods of one on one gene shuffling ("sex") and gestation. Memes use networks. Their propagation is, therefore, parallel, fast, and all-pervasive. The Internet is a manifestation of the growing predominance of memes over genes. And the Semantic Web may be to the Internet what Artificial Intelligence is to classic computing. We may be on the threshold of a self-aware Web.

2. The Internet as a Chaotic Library

A. The Problem of Cataloguing

The Internet is an assortment of billions of pages which contain information. Some of them are visible and others are generated from hidden databases by users' requests ("Invisible Internet").

The Internet exhibits no discernible order, classification, or categorization. Amazingly, as opposed to "classical" libraries, no one has yet invented a (sorely needed) Internet cataloguing standard (remember Dewey?). Some sites indeed apply the Dewey Decimal System to their contents (Suite101). Others default to a directory structure (Open Directory, Yahoo!, Look Smart and others).

Had such a standard existed (an agreed upon numerical cataloguing method) - each site could have self-classified. Sites would have an interest to do so to increase their visibility. This, naturally, would have eliminated the need for today's clunky, incomplete and (highly) inefficient search engines.

Thus, a site whose number starts with 900 will be immediately identified as dealing with history and multiple classification will be encouraged to allow finer cross-sections to emerge. An example of such an emerging technology of "self classification" and "self-publication" (though limited to scholarly resources) is the "Academic Resource Channel" by Scindex.

Moreover, users will not be required to remember reams of numbers. Future browsers will be akin to catalogues, very much like the applications used in modern day libraries. Compare this utopia to the current dystopy. Users struggle with mounds of irrelevant material to finally reach a partial and disappointing destination. At the same time, there likely are web sites which exactly match the poor user's needs. Yet, what currently determines the chances of a happy encounter between user and content - are the whims of the specific search engine used and things like meta-tags, headlines, a fee paid, or the right opening sentences.

B. Screen vs. Page

The computer screen, because of physical limitations (size, the fact that it has to be scrolled) fails to effectively compete with the printed page. The latter is still the most ingenious medium yet invented for the storage and release of textual information. Granted: a computer screen is better at highlighting discrete units of information. So, these differing capacities draw the battle lines: structures (printed pages) versus units (screen), the continuous and easily reversible (print) versus the discrete (screen).

The solution lies in finding an efficient way to translate computer screens to printed matter. It is hard to believe, but no such thing exists. Computer screens are still hostile to off-line printing. In other words: if a user copies information from the Internet to his word processor (or vice versa, for that matter) - he ends up with a fragmented, garbage-filled and non-aesthetic document.

Very few site developers try to do something about it - even fewer succeed.

C. Dynamic vs. Static Interactions

One of the biggest mistakes of content suppliers is that they do not provide a "static-dynamic interaction".

Internet-based content can now easily interact with other media (e.g., CD-ROMs) and with non-PC platforms (PDA's, mobile phones).

Examples abound:

A CD-ROM shopping catalogue interacts with a Web site to allow the user to order a product. The catalogue could also be updated through the site (as is the practice with CD-ROM encyclopedias). The advantages of the CD-ROM are clear: very fast access time (dozens of times faster than the access to a Web site using a dial up connection) and a data storage capacity hundreds of times bigger than the average Web page.

Another example:

A PDA plug-in disposable chip containing hundreds of advertisements or a "yellow pages". The consumer selects the ad or entry that she wants to see and connects to the Internet to view a relevant video. She could then also have an interactive chat (or a conference) with a salesperson, receive information about the company, about the ad, about the advertising agency which created the ad - and so on.

CD-ROM based encyclopedias (such as the Britannica, or the Encarta) already contain hyperlinks which carry the user to sites selected by an Editorial Board.

Note

CD-ROMs are probably a doomed medium. Storage capacity continually increases exponentially and, within a year, desktops with 80 Gb hard disks will be a common sight. Moreover, the much heralded Network Computer - the stripped down version of the personal computer - will put at the disposal of the average user terabytes in storage capacity and the processing power of a supercomputer. What separates computer users from this utopia is the communication bandwidth. With the introduction of radio and satellite broadband services, DSL and ADSL, cable modems coupled with advanced compression standards - video (on demand), audio and data will be available speedily and plentifully.

The CD-ROM, on the other hand, is not mobile. It requires installation and the utilization of sophisticated hardware and software. This is no user friendly push technology. It is nerd-oriented. As a result, CD-ROMs are not an immediate medium. There is a long time lapse between the moment of purchase and the moment the user accesses the data. Compare this to a book or a magazine. Data in these oldest of media is instantly available to the user and they allow for easy and accurate "back" and "forward" functions.

Perhaps the biggest mistake of CD-ROM manufacturers has been their inability to offer an integrated hardware and software package. CD-ROMs are not compact. A Walkman is a compact hardware-cum-software package. It is easily transportable, it is thin, it contains numerous, user-friendly, sophisticated functions, it provides immediate access to data. So does the discman, or the MP3-man, or the new generation of e-books (e.g., E-Ink's). This cannot be said about the CD-ROM. By tying its future to the obsolete concept of stand-alone, expensive, inefficient and technologically unreliable personal computers - CD-ROMs have sentenced themselves to oblivion (with the possible exception of reference material).

D. Online Reference

A visit to the on-line Encyclopaedia Britannica demonstrates some of the tremendous, mind boggling possibilities of online reference - as well as some of the obstacles.

Each entry in this mammoth work of reference is hyperlinked to relevant Web sites. The sites are carefully screened. Links are available to data in various forms, including audio and video. Everything can be copied to the hard disk or to a R/W CD.

This is a new conception of a knowledge centre - not just a heap of material. The content is modular and continuously enriched. It can be linked to a voice Q&A centre. Queries by subscribers can be answered by e-mail, by fax, posted on the site, hard copies can be sent by post. This "Trivial Pursuit" or "homework" service could be very popular - there is considerable appetite for "Just in Time Information". The Library of Congress - together with a few other libraries - is in the process of making just such a service available to the public (CDRS - Collaborative Digital Reference Service).

E. Derivative Content

The Internet is an enormous reservoir of archives of freely accessible, or even public domain, information.

With a minimal investment, this information can be gathered into coherent, theme oriented, cheap compilations (on CD-ROMs, print, e-books or other media).

F. E-Publishing

The Internet is by far the world's largest publishing platform. It incorporates FAQs (Q&A's regarding almost every technical matter in the world), e-zines (electronic magazines), the electronic versions of print dailies and periodicals (in conjunction with on-line news and information services), reference material, e-books, monographs, articles, minutes of discussions ("threads"), conference proceedings, and much more besides.

The Internet represents major advantages to publishers. Consider the electronic version of a p-zine.

Publishing an e-zine promotes the sales of the printed edition, it helps sign on subscribers and it leads to the sale of advertising space. The electronic archive function (see next section) saves the need to file back issues, the physical space required to do so and the irritating search for data items.

The future trend is a combined subscription to both the electronic edition (mainly for the archival value and the ability to hyperlink to additional information) and to the print one (easier to browse the current issue). The Economist is already offering free access to its electronic archives as an inducement to its print subscribers.

The electronic daily presents other advantages:

It allows for immediate feedback and for flowing, almost real-time, communication between writers and readers. The electronic version, therefore, acquires a gyroscopic function: a navigation instrument, always indicating deviations from the "right" course. The content can be instantly updated and breaking news incorporated in older content.

Specialty hand held devices already allow for downloading and storage of vast quantities of data (up to 4000 print pages). The user gains access to libraries containing hundreds of texts, adapted to be downloaded, stored and read by the specific device. Again, a convergence of standards is to be expected in this field as well (the final contenders will probably be Adobe's PDF against Microsoft's MS-Reader).

Currently, e-books are dichotomously treated either as:

Continuation of print books (p-books) by other means, or as a whole new publishing universe.

Since p-books are a more convenient medium then e-books - they will prevail in any straightforward "medium replacement" or "medium displacement" battle.

In other words, if publishers will persist in the simple and straightforward conversion of p-books to e-books - then e-books are doomed. They are simply inferior and cannot offer the comfort, tactile delights, browseability and scanability of p-books.

But e-books - being digital - open up a vista of hitherto neglected possibilities. These will only be enhanced and enriched by the introduction of e-paper and e-ink. Among them:

G. The Archive Function

The Internet is also the world's biggest cemetery: tens of thousands of deadbeat sites, still accessible - the "Ghost Sites" of this electronic frontier.

This, in a way, is collective memory. One of the Internet's main functions will be to preserve and transfer knowledge through time. It is called "memory" in biology - and "archive" in library science. The history of the Internet is being documented by search engines (Google) and specialized services (Alexa) alike.

3. The Internet as a Collective Nervous System

Drawing a comparison from the development of a human infant - the human race has just commenced to develop its neural system.

The Internet fulfils all the functions of the Nervous System in the body and is, both functionally and structurally, pretty similar. It is decentralized, redundant (each part can serve as functional backup in case of malfunction). It hosts information which is accessible through various paths, it contains a memory function, it is multimodal (multimedia - textual, visual, audio and animation).

I believe that the comparison is not superficial and that studying the functions of the brain (from infancy to adulthood) is likely to shed light on the future of the Net itself. The Net - exactly like the nervous system - provides pathways for the transport of goods and services - but also of memes and information, their processing, modeling, and integration.

A. The Collective Computer

Carrying the metaphor of "a collective brain" further, we would expect the processing of information to take place on the Internet, rather than inside the end-user