Steal this Ebook
The hippie bible "Steal this book" was an anti-capitalism
manifesto, widely remembered because of its creative and
aggressive title plus its subvertive message.
Paper books like that one were not easy to steal, because of the
guards in the bookstores and the costly effort of retyping,
mastering, printing, binding and distributing. In the digital
era you can get an ebook, crack or OCR it if protected, and
copy-pasting it.
The main client for ebook-stealing tools are webmasters, in
particular those with an interest for decent search engine
ranking. In order to please the search engines and acquire deep
indexing, high relevance and top ranking, a lot of relevant
content must be written (or copied).
When I was into science there was a phrase "Publish or perish",
that expressed more or less the same. Quantity instead of
quality. I guess most of us would like to have time to write
short and juicy articles instead of Goog-friendly mumbo-jumbo.
Machine-writing is a relatively new activity. And I am not
talking of simple copy-pasting, but a more sophisticated
text-generation breed of specialized software. Some simple
programs take a number of words and just mix them, like
Ktumbler. Others can generate random English-like phrases, like
the "Web Economy Bullshit Generator", or sensible good English
phrases by the thousand, like Phrase Generator in Synonymizer.
But phrases are not substitute for longer text, and many lazy or
greedy webmasters are tempted to copy-paste from the web, which
is easier and faster than thinking. A much criticized brand of
website ranking systems, Cloaking, copies large amounts of
relevant, well ranked text from the web, and shows it to the
search engine spiders. In this way, they pretend to have more
and better content than any other site. And the best of it, when
the visitor is a human, they hide the stolen goods and show some
innocent words. Cloaking refers to that ability, and it is
achieved by IP detection and comparison with known spider data.
Finally, the synonymizing and text-mixing tools modify the
original texts beyond recognition. Like facial surgery after
committing a crime.
However, I am afraid that there is a little copyright issue...
The problem is that nobody knows exactly what are the limits for
automated text surgery.
To synonymize or otherwise disguise a source text is morally
wrong, most of us would agree. But it has limits. If I say "My
Kingdom for an equine", or "After me, the big rain", or "Let's
there be illumination", you will understand that synonyms cannot
be forbiden. There are also many situations in which
overprotective legislation blocks creativity and innovation.
I am tempted to emulate Abbie Hoffman and discuss the morality
of copyright and the whole issue of intellectual property.
However, those are deep waters.
My point is that there is no established parameter to define
plagiarism in texts. What if I substitute "2" for "two" and Goog
for Google, as I do to avoid being noticed by them? And Yah for
Yahoo, not to forget the pioneer?. Or if I change just one word
in a phrase? What if I just mix the phrases in a text? Goog will
probably still consider that the keyword density is correct for
top ranking, minus a correction to account for the fact than the
start of a file is more important than the end.
Many Plagiarism Detection services can compare any submitted
text with a large library (mostly, the WWW) and decide if there
is enough similarity with a certain source. They usually do not
disclose their algorithm, but assure "it comprises proprietary
technology" and "detects digital signature of the authors". A
leader in this field says "Copyscape looks for pages containing
sizeable chunks of identical text". Nobody knows how that
translates into numbers.
Goog and the other search engines are against "duplicated
content", but they do not define it.
The availability of "text de-authoring tools" makes the
intellectual property issue very blurry to any attorney willing
to evaluate the existence of a crime. And as a collateral
effect, the modified text will not be detected by the
anti-plagiarism tools, which mostly search for exact text
coincidences.
I started to examine some Plagiarism Detection sites, and I
noticed that the better ones require some kind of a fee. Of
course, it is not easy to compare a student term paper with the
whole Library of Congress and the whole Web, plus the old Web
Archives in the Wayback Machine. Others like Ferret are free and
let you compare your file with another, but you need to provide
BOTH files.
So, without hiding my condition of SEO Tool maker, I declare the
need for a public algorithm that will establish if a text is the
un-ethical or immoral or illegal derivative of a web source. It
is necessary either for nailing plagiarists or to help writers
and webmasters to define the limits for near-plagiarism and
near-plagiarism devices.