Technology Advisor - Understanding the healthcare data lake

April 17, 2018
By Bipin Thomas

In the health care industry’s transition to consumer-centric and value-based care, we need plenty of data ranging from outcomes data, biometric data, socioeconomic data as well as genomic and familial data.

This complete ecosystem of data that is needed will increase the total amount of health care data exponentially. According to IDC research, the health care industry will generate 44 zettabytes of data by 2020. Even health care organizations that have enterprise data warehouses will be challenged to handle such volumes of data.

A data lake is an open reservoir for vast amounts of data inherent with health care. Health care organizations do not have enough time and resources to map it. A data lake brings value to health care because it stores all the data in a central repository and only maps it as needs arise. Determining how to structure data before it’s brought in — although common in health care — wastes time, money and resources. When the data is stored in the data lake, it’s impossible to know how to structure the data since all the use cases for that data are not known. Using the data lake approach of bringing data in and then adding structure as use cases arise is the right thing to do in health care to avoid long-term projects that ultimately fail.

The role of a data lake in the health care industry is essential, creating broad data access and usability across the enterprise. Data lake benefits include improved scale, schema, processing workloads, data accessibility, data complexity and data usability. A data lake is the preferred choice for larger structured and unstructured datasets coming from multiple internal and external sources, such as radiology, physician notes and claims. This removes data silos and it doesn’t demand definitions on the data it ingests. The data can be refined once the questions are known. A data lake offers great flexibility on the tools and technology used to run queries. These benefits are instrumental to socializing data access and developing a data-driven culture across the health care organization. A data lake is prepared for the future of health care data with the ability to integrate patient data from implanted monitors and wearable devices.

A data lake can scale to petabytes of information of both structured and unstructured data and can ingest data at a variety of speeds from batch to real time. Unfortunately, these capabilities have led to a negative side effect. Gartner’s hype cycle for 2017 shows that data lakes have passed the peak of inflated expectations and have started the slide into the trough of disillusionment. Initially, data lakes were predicted to solve all of health care’s outcomes problems, but they have ended up just collecting petabytes of data. Now, data lake users see a lot of detritus that can’t be used to build anything. The data lake has become a data swamp.

Understanding and creating zones within a data lake is the key to draining the swamp. Data lake zones form a structural governance to the assets in the data lake. These zones allow the logical and physical separation of data that keeps the environment secure, organized and agile. Zones are physically created through exclusive servers or clusters, or virtually created through the deliberate structuring of directories and access privileges.

Health care analytics architectures need a data lake to collect the sheer volume of raw data that comes in from the various transactional source systems used in health care such as electronic medical records, billing data, costing data, etc. Data then populates into various zones within the data lake. To effectively allocate resources for building and managing the data lake, it helps to define each zone, understand their relationships with one another, know the types of data stored in each zone and identify each zone’s typical user.

Data lakes are typically divided into four zones to make them work. Health care organizations may label these zones differently according to their preference, but their functions are essentially the same. Each specific zone is defined by the level of trust in the resident data, the data structure and future purpose and the user type.

Ultimately, a data lake helps health care organizations run their operations as a business. Real-time insights and predictive models mean fewer complications, fewer unnecessary emergency room interventions and higher levels of wellness across the population — all at a reduced cost.

About the author: Bipin Thomas is a renowned thought leader on consumer-centric health care transformation. Thomas is a board member of HealthCare Business News magazine. He is a senior executive at Flex, where he is leading business innovation by deploying cross-industry solutions with intelligent products and connecting key industry stakeholders. Thomas is a former senior executive at Accenture and UST Global, where he implemented strategic digital initiatives across the care continuum.