As part of my upcoming article on “Google and Personal Privacy”, I needed some facts about just how much information Google has to work with and how much power does it have to manipulate and index this information. I quickly discovered that no one outside of Google has a clue. This is more than remarkable considering that Google employs more than 50,000 people. To my knowledge no organization of this size, outside of possibly the Mossad, has managed to maintain such a level of secrecy. I can only assume that employees are forced to watch extreme torture videos on a regular basis, accompanied by admonitions and threats of the kind that usually include family members and beloved pets.
This left me with a dilemma. My wife wisely warned me against letting my thoughts drift toward kidnapping a Google employee and hiring a specialist from the Sinaloa Cartel to extract my needed information, although, to me, it did seem like a good choice. Getting a job with Google was out of the question since they would probably expect me to do work – a talent I have not yet acquired. So I had to count on those few paltry talents that I do possess. Among those talents is an ability to add, subtract, multiply and divide (putting me right up there with the average fourth grader), and a not so shabby grasp of social engineering.
Applying the advanced math outlined above I quickly discovered the following:
1- There are 860,000,000 discrete Web sites on the World Wide Web
2 – The average Web site contains 5,412 pages
I can’t believe no one had this number readily handy. Admittedly, the problem seems at first glance insurmountable. Facebook, Tumbler, Twitter and dozens of othe sites have more than 100 million pages each while over 300 million websites have only one or two pages each:
It is a difficult problem to approach unprepared. Fortunately, in my basement I keep a few wizard level software engineers whom I pay with drugs and the services of Bankok prostitutes and whom I keep in reserve for just such an emergency. Web crawlers, in spite of Google’s posturing and chest pounding, can be programmed in a few hours by moderately competent programmers, even if they are in a sleep-deprived, drugged up state. The programming was easy. What was not easy was acquiring the necessary computing power and band-with. Fortunately, in my profession, we frequently stumble upon embarrassing and compromising facts while looking for legitimate problems. I made a habit of storing away these juicy tidbits for a rainy day. A few gentle calls solved our resource problem and – voila! A random sampling of 100,000 websites was selected and our number is displayed above.
You may either accept my number or go work it out for yourself.
3 – Meaning that the world wide Web contains slightly more than 4.8 trillion pages.
For the math enthusiasts among us I have included in Appendix A the detailed formulas and calculations for how (3) can be inferred from (1) and (2).
Now…. those people who study the Web will immediately say that 75% of all websites are inactive, and they of course are correct. My answer is: So??? We are talking about what Google is doing, not about the current relevancy, to ourselves, of data which may no longer be evolving. However, I did need to know how this non evolving data was viewed by Google in order to gauge the extent of Google’s power.
Few people have more inactive websites than myself. I start businesses and end them with alarming frequency, so my inactive list is legion. I chose one of my larger inactive sites and peppered the pages with the word “transquishbulant” (I actually used a different but equally interesting word, but have no intention of allowing the general public into my personal affairs so please forgive the deception). I sat and waited – pondering sublime subjects that have long puzzled philosophers and mystics and indulging in dark epiphanies for a week or so. Then I Googled my word. Lo and behold – I was taken straight to my site.
The argumentative souls among you might say that my inactive sites are of course watched closely by Google since I appear to be Google’s public enemy Number One. While it is true that I am not allowed on any Google campus (see emails below), it is hardly believable that a 69 year old man with a bad knee and and a mind nearly gone from extreme abuse while young and who is also, by the way, totally preoccupied with avoiding hit squads constantly sent by “Friends of Belize”, could be a credible threat to Google. If you believe none of that, then believe this:
I am certain in my heart, that in spite of possessing a social conscience so twisted that, centuries hence, psychologists and social scientists will still be baffled as to how it could have arisen, the basic character of Google’s management would not allow such an “improper” act as singling out a person for unique processing due solely to dislike of that person. I am convinced of this, in spite of the unintelligible dichotomy created by such a viewpoint.
So I am including inactive pages in my calculations. Do with me what you will.
So much for the easy part, which any ten year old with time on his hands could have done.
Now comes the difficult part. Knowing the size of Google’s knowledge domain gives us no clue as to the power Google possess to access and manipulate this knowledge. “Might as well not have bothered” quipped my adoring wife, although, grateful that I did not pursue the Cartel approach to getting my information, it was spoken without the usual cutting edge to her voice. “We’ll see” I replied with secret glee.
In my infinite arrogance and totally unsupported self esteem I have always viewed myself as a social engineer sine qua non. While I may fall far short my own own estimation of myself, even my detractors grudgingly rate my capabilities as “barely passable”. That’s good enough for our purposes here. Those who object can eat me.
So social engineering is the next tool I reached for in our Quest to Corner Google.
I meant what I said in the previous section about the character of Google’s management. I sincerely believe they would not stoop to dirty tricks. In order to make this a fair contest I would have to adhere to the same ethic. It would be easy as dirt to use active social engineering methods to get the information. Active methods include influence, deception, pretense, etc. An active strategy might include acquiring the phone number of someone in the Spam Reduction unit within Google, calling them and saying:
“This is Gordon Wong, the new V.P. of Statistics Analysis in Singapore. Where the fuck are my corporate stats that your boss promised me? (You would have make sure that the employee’s boss is on a silence retreat in the Himalaya mountains, or is similarly unavailable during this conversation). Your boss warned me that you constantly dropped the ball. I’m going to see to it that you are dropping balls at some other company from now on”.
There is more to it of course, but this is the gist of it. However, Google management would frown on such an approach and deem it improper. So I am left with passive techniques. Passive techniques involve digging beneath the external, observable characteristics of an organization or a person and discovering the motives or actions behind them. These techniques are less precise and more time consuming and they frequently fail outright. It’s just the sort of challenge I was looking for
If you are applying passive social engineering techniques to a large organization, then the passive scientific approach dictates that we should first attempt to find someone who has spilled the beans (the beans we’re looking for) unthinkingly in a social media forum, a blog, a professional speech or some similar venue. This hasn’t happened with a Google employee. You must believe me. It’s how I knew Google was forcing its employees to watch torture videos. The second rule is the Sophomore rule – the “Wise Fool” rule. The Wise Fool is the one who, without giving out facts, tells the whole story by not not being able to keep his mouth shut. So, with the help of Juju beads, a large ingestion of Peyote buttons laced with DMT, and an overweight Hopi Shaman, I was introduced to a German speaking Dung Beetle (wearing one of my old T-shirts) who led me to Matt Cutts.
Matt Cutts is in Charge of Google’ Spam unit. When I discovered him I felt that even the unbelievable nausea induced by the peyote buttons was well worth it. He is a passive social engineer’s dream come true. Within a few short hours I discovered the following quote from him:
“Google engineers refresh a large fraction of the web every few days”.
It was an auspicious beginning. While it was vague and could imply any number at all, it was a start. My first issue was – “What the fuck does a Large Fraction mean??”
I first looked up the legal definitions just to get a base point and found nothing. I did discover one court court case in Texas that stated clearly, from a legal standpoint, what it was not.
It found that 1/6 was definitively too small to be called a Large Fraction. Not much, but the information didn’t hurt any. What I needed to find out was what Matt Cutts generally meant when he used that term.
Some time and a few Peyote buttons later, I stumbled upon this gem from Mr. Cutts in WebPro News:
“You can separate simple popularity from reputation or authority, but now how do we try to figure out whether you’re a good match for a given query?” Cutts continues. “Well, it turns out you can say, take Page Rank for example – if you wanted to do a topical version of Page Rank, you could look at the links to a page, and you could say, ‘OK, suppose it’s Matt Cutts. How many of my links actually talk about Matt Cutts?’ And if there are a lot of links or a large fraction of the links, then I’m pretty topical. I’m maybe an authority for the phrase Matt Cutts.”
God praise the Wise Fool!!!
Knowing that Mr. Cutts thinks nearly as much if himself as I think of myself, I am certain that he knew EXACTLY what fraction of pages that link to him actually talk about him. It was a brief exercise in triviality to reveal that 98.7% of all pages that link to him talk about him. Knowing that we all think of ourselves as more important than the rest of the universe combined, this gives me Mr. Cutts upper limit on the meaning of “A Large Fraction”.
But what about the lower limit. This was much harder and required another visit to the Shaman and involved a remarkable adventure with the Dung Beetle who oddly enough spoke Cantonese this time. What is even more odd is that so did I. But that is a story for another time.
The Beetle led me to a Twitter Post from Mr. Cutts that referenced his following quote:
“Google identifies any page that “doesn’t have a lot of visible content above-the-fold or dedicates a large fraction of the site’s initial screen real estate to ads” as the type of page that will be negatively affected.”
At last!!! A large fraction that be easily quantified. I won’t go into detail here as to how I arrived at the answer, but any idiot with a few weeks of time to waste can verify the following:
A page’s Google ranking is not significantly impacted until at least 55% of the initial screen real estate is composed of ads.
It seems then, that in the mind if Matt Cutts, the lower limit for “A Large Fraction” appears to be 55%.
Still a wide range you may say. Indeed, but if we choose 77% as our number, we can only be off by a factor of 1/4. Not bad for a passive technique involving numbers in the hundreds of billions.
It only remains to resolve the meaning of “Every Few Days”. For this I didn’t even need to dig into the mind of Mr. Cutts. The collective consciousness of the society in which Mr. Cutts lives and was raised (the USA) has time honored and strictly adhered to partitions for phrases of duration: twice a day, daily, every couple of days, every few days, weekly, every couple of weeks, every few weeks, etc. “Every few days” sits comfortably between 2 days and six days. Let’s choose 4 for safety.
Taking everything together we can say that Google crawlers are capable of reading and indexing 660 billion Web pages ever day, within an accuracy of about 50%. Before you trash me for such wide margins, keep in mind that prior to today no one has been able to say anything more than “I have no fucking clue”.
Now, people more intrepid than myself might want to to take this paltry effort and calculate the probable number of servers that Google owns. Everyone knows the exact computers used, and it is a question that many want an answer to.
However, I will leave these calculations to those of us mathematicians who fantasize about third order partial differential equations while having sex with whatever elements of the physical world appeal to their fancies. As for me, I must find more peyote buttons so that I can finish my real article on Google and Personal Privacy of which the above will be merely a footnote.
Reference: Google Emails to Myself:
——– Original message ——–
From: Cory Altheide
Date: 01/24/2014 1:52 PM (GMT-06:00)
Subject: Re: Availability to speak at Google in January?
Excellent – I’ll perform an email introduction with my colleague.
On Fri, Jan 24, 2014 at 11:23 AM, jdavidmcafee wrote:
I would be happy to speak at Facebook.
As an aside, I fear the reason that I am personna non
grata at Google is because I fucked the wives of
three of Google’s top executives last year (and
possibly one of their daughters – I’m not entirely sure. >>It’s possible that she or myself was hallucinating.)
Women just can’t keep their mouths shut. I swear if
I could control my gag reflex I would turn gay.
——– Original message ——–
From: Cory Altheide
Date:01/24/2014 10:05 AM (GMT-08:00)
Subject: Re: Availability to speak at Google in January?
Hello again John –
I haven’t managed to get any traction in hosting you here, however, a colleague at Facebook is interested in hosting you and has high-level support in their security organization. Would you be amenable to this? If so, I could put you into contact with him, and would love to attend.
Let me know if you’d like to speak with him, and apologies again for the situation here –
On Wed, Dec 18, 2013 at 9:21 PM, Cory Altheide; wrote:
I’m apparently unable to host you as a speaker at Google. Sorry to do this to you, but unfortunately it’s out of my hands (I’ve been overruled). I’m attempting to push back but at the moment it is does not look good. My sincerest apologies, and I do hope to hear more about your new venture at some point in the future.
On Mon, Dec 9, 2013 at 10:32 AM, jdavidmcafee; wrote:
Yes. January, after the 10th I am free