With 2.5 quintillion bytes generated per day in the world, the modern digital landscape is an ever growing ocean of data too big for the human brain to comprehend. But because technology is what brought us here in the first place, technology is also developing the tools to make a sense of Big Data and learn things we never thought we could. If you're just starting to dig deep in the world of big data, then you're at the right place. In this article we'll do a comprehensive analysis of what big data is, big data examples, big data tools and of course the 3 Vs of Big Data.
According to the Oxford English Dictionary, Big Data stands for “extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.”
To put “extremely large” data sets in perspective, let’s think of little data as an amount of data that is small enough to be understood by the human brain. For example, a market research agency puts out a 5 document page on qualitative findings of the perceptions on Orange Juice. Or to make it even simpler, “little data” could be your web browser history, or your bank account movements for the last month.
Now, understanding the size of Big Data is tricky because of the fact that it keeps on expanding. The amount of data generated in 2010 was of 1.2 trillion gigabytes, and in 2020, it will be around 40 trillion gigabytes. Because of this, we can’t define specifically what ranges of data would be considered Big Data.
The key to the answer is what follows in the definition. “Data sets that may be analyzed computationally to reveal patterns, trends, and associations”. In this sense, we understand Big Data as a massive repository of data that is able to reveal new information about something, and that it would be impossible to reveal if we did not have the computer power to do it. Big Data has the potential of creating amazing insight, but because of its immensity and complexity, it requires some polishing first.
- There are 500 million tweets sent per day
- Every hour, Amazon sells, on average, more than $17 million
- Google handles 3.8 million searches per minute on average across the globe.
- 85 million households watched at least two minutes of Netflix’ original movie “Spenser Confidential.”
In the last example, imagine the data that 85 million people watching a single show can generate: where are they located, what devices are they using, what moments in the movie are they pausing, what languages do they speak, what subtitles are they using. Getting deep in the data and being able to extract clear and veridic information about the audience behavior will inevitably allow Netflix to create more successful future content.
The examples above are the most common and massive sources of Big Data that are available today. But what about Big Data for smaller businesses?
If you have machinery on a production line, then utilizing data from sensors could help you increase efficiency.
If you are a Retail company, then you might utilize data generated from social media hashtags, web search trends and even the weather forecast to improve your stock and product intelligence.
GPS and sensors used on trucks could help tremendously to businesses that have a supply chain and delivery routes and that can optimize their process by analyzing data of time, traffic, route, etc.
Diagram by Hassanin M. Al-Barhamtoshy
The 3vs of Big Data, more than characterizing the aspects of it alone, help us define what Big Data is comparing it to “little data”.
Volume: Volume, as suggested by its name, speaks to the amount of data that needs to be processed and it may vary tremendously from one business to another. For example, hospitals around the world generate annually more than 2,000 Exabytes of information. 1,9 million vehicles pass a specific intersection per day in Toronto,and 350 million photos are uploaded everyday on Facebook.
Velocity: Velocity is related to the speed at which data is acquired and then processed. Some Big Data analysis requires higher speeds than others. For example, Google has to look into billions of websites everytime you search, and gives you a specific order of results in just a matter of seconds. Sensors in cars also have to process data at extremely fast rates to let you know about possible collisions.
Variety: This refers to the various data types: Structured, Unstructured and Semistructured. Structured Data exists in clear, predefined formats, making it easier to analyze and understand. Think about a spreadsheet that has information about names, pre existing conditions, addresses, credit card information, etc.
Unstructured Data, on the other hand, cannot be fixed to a specific field or column in a spreadsheet, making it harder to understand. Think about images, like X-Rays, or any other form of content like audio and video files, the text of an email, presentations, PDFs, etc.
Semistructured Data is then a mix between the both. Data that is fluid enough but that also contains certain classifying characteristics. Think about the metadata of a photo taken by your phone, it probably has a time and date stamp, geo location, resolution as well as information about the device it was taken with.
The two extra Vs for understanding Big Data
The digital revolution has created insurmountable amounts of data that continue to grow everyday. For something to be classified as Big Data, experts have now incorporated two extra Vs to encompass a more practical and usable understanding of Big Data.
Veracity: This term talks about theaccuracy and trustworthiness of the generated data. Having quality data means that the results of any Big Data analysis will also be of a decent quality. It is very common for data sets to have plenty of errors that would affect any insights generated after.
Value: Analysing Big Data has to generate value. For example, in the health industry, Big Data analysis has to enable faster disease detection, better treatment and reduced costs.
Through the years, various frameworks have been developed with the task of storing and processing Big Data. Some of these examples include Cassandra, Hadoop and Apache Spark. Hadoop, for example, uses a distributed file system to store big data. Distributed means that they store files in various machines. Let’s say you have a huge file, Hadoop would then break down that file into smaller chunks and store it in various machines.
You can then utilize Motivus to start processing that data. The High Performance Computing Network would break task A into smaller tasks, B, C and D. And would then send those tasks to different machines in a parallel fashion to analyze them and then assemble them in real time. This processing becomes easy, fast and collaborative, since the computing power relies on various machines all around the globe.