Does Big Data Really Exist?
I've been scratching my head for sometime now on this topic of "Big Data". What is Big Data? Where did it come from? Most importantly, I've been wondering: what are people doing with all this "Big Data" that makes it such a craze in IT today? After much time spent noodling on this subject, I've come to the (current) conclusion that "Big Data" really doesn't exist. It's just data, and all the hype around it is about the "excellence in statistical graphics to convey complex ideas communicated with clarity, precision and efficiency". In this post, I hope to demonstrate that "Big Data" itself does not exist and the hype around data is unbiased visualization through graphics that reveal information.
Most academics and industry analysts would define "Big Data" using the 3Vs: Volume, Velocity and Variety. Others would cite "Big Data" as Digital Traces of Human Activity. In their 2012 article in the International Journal of Internet Science, authors Snijders, Matzat and Reips loosely define "Big Data" to describe data sets so large and complex that they become awkward to work with using standard statistical software.
With these definitions in mind, when does data become "Big"? Considering the volume aspect of data, what appears to be "Big" today will surely appear minuscule by the standards applied in the next decade. Surely, the data sets we had access to less than 20 years ago looked "Big" over a hundred years ago. Volume, I would offer, is relative, and in this age of the Internet, volumes will simply get bigger and bigger. Hence, today's "Big" data will appear small when compared to tomorrows data, and it is quite obvious that as a civilization, we may never uncover the "Biggest Data", as we will always test the bounds of infinity.
As digital devices continue to get smaller and smaller, and the ability to move data from point to point becomes increasingly faster, we tend to compete with speed of lightissues in many cases. Applying the velocity aspect of "Big Data" to data, and assuming technology will evolve to a point where data will all move at or above 99.7% of the speed of light, does "Big Data" cease to exist when velocity is constant across all forms of transmission? Or does data that only moves at the speed of light qualify to the grandeur title of "Big"? The velocity of data will continue to increase, and asresearchers continue to search for ways to move information faster than the speed of light, the velocity of data will continue to increase.
Curious minds have always asked of more from data. Consider this catchy video about some creative uses of "Big Data". In the video, the producers call out Law Enforcement using "Big Data" to predict crime. They use sophisticated algorithms to predict where crimes are likely to happen. We've heard of countless other stories of creative use of "Big Data": facial recognition on social media, identification of enemy combatants in the battlefield with high-speed image recognition, and recently, the identification of the Higgs Boson (the "Biggest Datum" dare I say?) with "Big Data". Did the very recent emergence of "Big Data" finally allow someone to arrive to a conclusion that correlating past criminal activities may lead to future predictability of crimes? Or did the ability to capture, curate and visualize data allow us to finally answer an age old question? I'm sure there were plenty of smart people out there who said to themselves "if I only had this data" I could answer that question long before the data got "Big".
Big Data is nothing more than the increasingly efficient capture and curation of data coupled with the means to visualize it in a way that tells a story. With the advancements in data storage techniques such as high-speed key-value indexing, inverted indexing, document based storage and more, we're able to map data sets together in ways which were previously difficult. With technological advancements in data storage such as higher speed and higher capacity solid state storage devices, we're able to store more for less, and access it faster. Finally, with distributed computing algorithms that extend beyond the server and to a group of servers (affectionally renamed as Map-Reduce), we're able to decompose extremely large data sets into smaller, manageable chunks for analysis.
"Big Data" is just data. Whether it is digital or otherwise, the capture and curation of data that may have been only available to a few can now be made available to curious minds the world over. The means of gleaming information and presenting it in an unbiased, empirical manner to enable the viewer to answer questions about relationship, causation and correlation is the real enabler of advancement. As the data sets get bigger, so too will the computing and storage capabilities needed to analyze it, and so too will the creativity around effective means to store and retrieve related information. Visualization techniques will pave the way forward to answering many age old questions and some of these answers will lead to more questions that "data" itself may hold the answers to.
Perhaps this may be a great time to trademark the term "Big Answers" and figure out how I can market it at a later date. Thanks for reading, and please feel free to pass on constructive comments, thoughts and feedback!