Big data is all data

What is “big data?” One convenient litmus test to answer the question is: when the volume, velocity or variety of data becomes too great to handle with conventional data processing tools and techniques, you know you’re dealing with big data. Indeed, the technologies most often associated with big data such as Hadoop and MapReduce are important because they make data characterized by the “three Vs” of volume, velocity and variety more cost efficient and effective. However, this way of looking at the issue limits big data to a technical challenge and misses what has become the real significance of big data: finding new ways to use data to create business value. So if an organization is setting aside certain types of data (such as sensor data captured from their delivery vehicles, or clickstreams from a heavily-accessed website) to the exclusion of other data, they’re missing the real benefit of big data.

Think of it this way: for every physical device generating machine data or customer sharing information about themselves in social media, there is a network of data that defines the business context and enriches any analytics that we might perform on the subject. Consider, for example, a delivery vehicle that generates geo-location and temporal data, as well as sensor readings from its mechanical systems such as engine performance, temperature and fuel consumption. A company might use that data to perform analytics to optimize routing, delivery schedules, service agreements, staffing and more. While we can marvel today at the amount of sensor data emanating from a modern delivery vehicle, in fact a lot of useful data about a vehicle (or any other major piece of equipment) is already percolating through many other systems. For example, purchase information about the vehicle, along with technical specifications, might be stored in a ERP system; information about the driver (training curriculum, years of experience, driving record, etc.) could be in an HR system; maintenance records could be in another system.

could go on with examples but you likely get the picture that most of the things generating new or raw data types are also referenced in many other systems around your enterprise. The same principle applies to clickstreams on websites, machine log files and other things. Being able to connect to these systems and augment analytics with additional business context can add a very powerful element.

And you shouldn’t ignore so-called unstructured content when considering data to include in a big data project. While social media analytics is a popular and widely-explored use case in the big data world, there is a really a whole universe of human-generated content to be mined. Most organizations manage vast amounts of human-generated content ranging from mundane things like operating manuals to more interesting things like message archives, wikis, lab reports, customer interaction summaries, comments in survey responses and strategy documents—the list is endless. This data is usually stored in content management systems and other secured repositories that don’t lend themselves to easy access with typical big data analytical tools.

Here’s a simple taxonomy, or checklist, you should be considering when deciding what types of data to include in your next big data project. Not all types of data are available or relevant for every project, but it may be helpful to go through the step of considering these categories:

  • Sensor and machine data, which reflects the physical world or the performance of devices across the “Internet of Things”
  • Business applications and systems of record which contain the transactional records of the organization as well as information about business practices
  • Human-generated language and content of all kinds, whether in formal documents, message logs, reports or internal discussions
  • And finally there’s the data outside the organization such as social media content

For some more thoughts on how organizations can expand the sources of data they incorporate into their big data projects, I invite you listen to my podcast on the topic on the IBM Big Data Hub.