| Oct 30, 2012
I presented last week at Xconomy's BIG DATA Forum. It wasn't until after I accepted that I realized I didn't know what the term big data actually means. So I turned to Wikipedia to find out. According to Wikipedia, "The term big data is a buzzword, and is frequently misused to mean any form of large-scale data or information processing." Well, that explains why I was invited to speak. "Large-scale data" is Carbonite in a nutshell. I wouldn't use the term "big data" for Carbonite. We're more like "Extremely Large Data" or "gigundous amounts of data."
Big data is far from a new idea, of course. Regression analysis, one of the first mathematical treatments of data sets, was described by the mathematician Carl Friedrich Gauss in the early 1800s. Back in the early 1990s, I was CEO of a company in Cambridge called Pilot Software and we did what at that time was called Data Mining. Our flagship product was a multidimensional database that was especially good at slicing and dicing very large transactional databases looking for patterns and anomalies. It was used by companies like McDonald's and JCPenney to uncover nuggets of useful management information in the vast quantities transactional data produced by their cash registers. There are many other buzzwords that have come and gone as well, including business intelligence, data analytics, decision support, data dredging and so on.
So I had to ask myself, what's makes big data different from the "data mining" I was doing more than a decade ago. As far as I can tell, it's mostly a matter of quantity, with data now coming from all kinds of devices and places, not just cash registers.
So, just why does big data warrant a new buzzword? I can think of three reasons: 1) it's probably harder to raise venture capital for a data mining company these days, 2) maybe it helps IT professionals justify bigger budgets, and 3) of course, how could you organize a successful conference without a new buzzword? ;)
Personally, I appreciate that Google can use big data to serve up an ad for something I might actually find interesting or that Netflix knows that I might prefer Steven Hawking to the Texas Chainsaw Massacre. But I'd be cautious about all the hype about big data because it's is no substitute for judgment. Judgment is actually the ultimate form of big data. You are taking a lifetime of experiences and applying them in a very sophisticated way to predict what's going to happen in the future. For example, with all the data that's available in the financial markets, with computers making millisecond high velocity trades, MIT-trained physicists and mathematicians laboring away in the bowels of Goldman Sachs, there is still no way to replace a Peter Lynch or a Warren Buffet.
The stuff we do with computers is trivial by comparison. This morning when I was eating breakfast, my cat leapt a good six feet from the sink to the table and landed perfectly right between my newspaper and the cereal bowl. I thought, wow, now that's big data at work. Think of the accumulated experience and judgment that it took to pull that off. Think of the enormous quantity of visual data, the 3D image processing, the billions of neurons, the precise muscle control, the knowledge that if you land on the newspaper it's going to slide, and the knowledge that if you land on my cereal bowl you're going to get whacked. That's big data and it's humbling.
So that brought me back to what was I doing talking about big data? Is Carbonite really big data or just ‘a lot of data'? Well, according to another definition I found on the web, "Big data involves the extraction of useful insight from volumes of data so large that traditional database tools cannot handle the workload." Well, we would certainly qualify on to "too large for traditional database" front. In fact, we get over 300 million new files every day. And we add around a petabyte of storage every couple of weeks. I doubt that very many companies in New England have more data than Carbonite.
But as for the "insights," well, the whole idea in backup is you have this sacred trust with your customers that you're not supposed to know anything about them. And you're certainly not supposed to be able to glean any kind of insights from the contents of their computers. To break that sacred trust would probably spell doom in our business. We spend tens of millions a year on advertising to create a trusted brand. What does "trusted" brand mean? It means your data is private. Confidential. Secure. Nobody gets to look at it for any reason. We bend over backwards to encrypt at every step of the process. There are zero unencrypted customer files in our data centers. We hire white-hat hackers to try to compromise our security. We have PhDs on our team whose sole job is to prevent people from extracting insights from our customer's data. So just what do we do with all that data? We store it and give it back to our customers when they want it. That's it.
The one thing we may have in common with big data is that it takes very specialized technology just to deal with the volumes. Commercial databases and file systems aren't designed to work at this scale. I remember when we were first starting out we stored backed up files using Window's NTFS file system. When we got to about 500 million files, things got trickier. So we called Microsoft and they asked "How many files do you have?" We told them "about 500 million." There was a good 10 seconds of silence on the other end of the line. "Uh, NTFS wasn't really designed to handle that many files." Since then we have backed up over 200 billion files and keeping track of all of them and insuring that none of them are corrupted takes very specialized technology. Especially when you consider that it has to be done very inexpensively because we offer unlimited backup for $59 per year.
And all this new data comes salted with some interesting new privacy issues. While we most definitely don't want to know anything about the data on our customers' computers, we do know some things that help us provide a better service WITHOUT violating the sacred trust between us and our users. For example, we can suggest tips and user hints based on the type of computer or mobile devices our customers are using Carbonite with.
But, as I think about big data on whole, one concern is that even data that is innocuous by social media or credit card company standards, tilts the playing field away from the inpidual. For example, if I am shopping for a flight, it's easy for a travel site to figure out how badly I need to go on a certain day and price my ticket accordingly. Think about it. If you're on a plane with 200 other passengers, how many do you think paid the same price that you paid? It's the same with hotel rooms, or any number of other services. The problem is the systems know more about you than you probably want them to know, and that could put you potentially at a disadvantage.
Something else to consider is the potential for prying. In most countries, we assume our governments are benign, that the magazines we read, the books we take from the library, the causes we support, the purchases we make, even the places we visit, will not be used by some government to persecute us in the future. It doesn't even have to be a government – when you apply for a home mortgage for example, and you're denied, do you really always know the whole truth about why? It's all in your big data, but that doesn't mean the data is correct and it doesn't mean that you will ever know what's really there.
For those of you whose businesses are based on big data, I urge you to use caution when dealing with privacy issues. Ralph Waldo Emerson said, "There are many things of which a wise man might wish to be ignorant." I think that may apply to businesses as well.