What is Data Lakes Market

Data lake as the basis for Industry 4.0

For a long time, the data warehouse was the central source for all data analyzes. In the course of increasing digitalization and the associated mass of available data volumes, the data lake has now overtaken the classic data warehouse. Numerous use cases from the context of Industry 4.0 are inconceivable without a suitable data platform based on the data lake concept.

The data architecture

The right architecture for dispositive data processing had been clearly defined since the 1990s. An ideally singular (enterprise) data warehouse collects the relevant data from the different operative source systems in a hub & spoke approach and harmonizes, integrates and persists this in a multi-layered data integration and data refinement process. A single point of truth should arise from a data perspective, from which data extracts - usually in a multidimensional format - are then stored in definable data marts for different applications. The user accesses this treasure trove of data via reporting and analysis tools (business intelligence). The primary focus is on the more backward-looking analysis of key figures along with consolidated evaluation structures.

Characteristics of a data warehouse

An essential characteristic of the data warehouse is to represent a valid and consolidated truth about all structured data in a company. In addition to this uniform view of the company data, the data warehouse provides the data for evaluation in an optimized manner in a strict and previously defined data model. This high standards of correctness and degree of harmonization usually led to the fact that it takes a long time until data from a new data source is integrated into the consolidated view, because a lot of design and coordination effort is required in advance.

Fast data preparation with the data lake

With the emergence of new data sources such as social media or IoT data, the need to make these available on a data platform increased. Much of this data is then available in semi-structured or unstructured form. With the increasing relevance of these data sources, the idea of ​​the data lake was born. The data lake would like to make all source data - internal and external, structured and polystructured - available in their unprocessed form as raw data in order to have them available as quickly as possible. The efficient handling of large amounts of data, a fast processing of data streams and the mastery of complex analyzes are in the foreground of the data lake at the expense of the harmonization and integration of the data.

Data warehouse vs. data lake

Compared to the data warehouse, the data lake is more like the Integration of diverse data sources with the greatest possible agility and flexibility in the foreground, in order to create the database for a variety of advanced data analyzes that are usually not yet defined at the time of data storage. The data lake is the Eldorado for the data scientist who wants to carry out exploratory analyzes such as cluster / association analyzes, simulations and predictions using complex algorithms. It is therefore also clear that a data lake does not replace a data warehouse, but complements it. Both architecture concepts are relevant and serve different use cases from one another.

 Data warehouseData lake
Data- No raw data storage
- Structured data
- Schema-on-write: data is transformed into a specific schema before being loaded into the data warehouse
- Raw data storage
- Flexible with regard to the data structure (unstructured and structured)
- Schema-on-read: Automatic recognition of the schema during the reading process
processingThe data layer and the processing layer are inextricably linkedVery flexible because there are different frameworks for processing for different areas of responsibility
AnalyticsDescriptive StatisticsAdvanced Analytics
Agility- Lower agility
- Fixed configuration
- Ad-hoc analyzes are not possible
- High agility
- Customizable configuration
- Ad-hoc analyzes possible
securityMature- Due to the multitude of technologies that are used within a data lake, multiple configurations are necessary
- Security policies are more complex

There are two major technical drivers for the use of data lakes in industry: The Optimization of the production and the Offer better or new products, partly also completely new business models. The basic use cases here are the “digital twin”, i.e. the digital image of the machines produced in-house and the connection of these to the data lake with almost real-time data up-to-date. There are two major obstacles to be overcome in practice: The master data required for materials and components are stored in systems from different organizational units that have not yet been mechanically communicating with one another. In addition, different technical protocols are used at the technical level, so that communication components must first be retrofitted as a prerequisite for data availability.

The technology

The first generation data lakes were systems based on the Apache Hadoop stack in the company's own data center. With these early platforms, the complexity of the technology, consisting of numerous open source components and the connection with the required timeliness, were also challenging. Due to the change in the market situation of commercial distribution providers and the general strategy of increased use of the cloud, this is shifting in the case of data lakes of the second generation: At Use of native cloud services and or dedicated managed Hadoop environments The complexity of the management of the base platform is massively simplified. Thus, the entry barrier has fallen and today it can be used for almost any company size.

However, the recommendation remains valid to only use technology when a clear use case evaluation and prioritization on a roadmap has been defined as the cornerstone of the application!

Choosing the right technology

The decision to select the components to be initially used must be carefully considered and alternatives from the market for commercial, open source and cloud services options must be continuously searched for and evaluated in order to be able to create optimal added value for the company.

When selecting the components to be selected for your own company, in addition to the functional requirements in industrial use, the protection of trade secrets from (global) competitors and legal aspects, such as the use of the platform with data from countries in which legal Restrict data exchange geographically, in the foreground. A special feature of machine manufacturers is the additional challenge of accessing the data of their own machines in the customer context, since machines from different manufacturers are often used in combination and customers, in turn, do not disclose all data to protect their company.

Another area of ​​tension is the requirements of productive use cases versus the needs of data science users. Here, too, the approach has changed over time: If you first tried to build platforms that could serve all usage profiles - from the provision of an API for a customer portal with high response time requirements to complex analytical queries - the breakdown has changed this proved to be more practicable on various technical platforms.

The key conditions in practice

When setting up a data lake initiative, key conditions are found in practice as the basis for successful implementation, which are similar to those of the implementation of a central data warehouse: A strong management decision for setting up and using a central platform initiative and the resulting, in many cases not yet implemented , close cooperation between specialist and production IT, possibly also product development, are elementary. In this way, not only diverse data are brought together, but also knowledge of this data, such as the signals from individual sensors and the interpretation of the states as a system, is combined.

Last but not least, the operation of a data lake must be set up flexibly and holistically: A DevOps team has proven itself as best practice, which continuously develops the platform and keeps it stable in operation.

Conclusion

In conclusion, it can be summarized that every Industry 4.0 initiative needs a data lake platform. The technological entry barrier has fallen, but still requires well-founded planning of the architecture. The basis should be a roadmap for use in use cases. In order to maximize the resulting long-term added value, it is also important to create the necessary organizational prerequisites for the successful use of a data lake platform in addition to the technology.

Dr. CARSTEN DITTMAR

Dr. Carsten Dittmar is Partner and Area Director West at Alexander Thamm GmbH. He also heads the Strategy Practice at Alexander Thamm GmbH. For over 20 years he has been working intensively on the fields of business analytics, data science and artificial intelligence with a focus on strategic and organizational advice on data-driven initiatives. Dr. Carsten Dittmar is a European TDWI Fellow and author of various specialist publications and speaker at numerous specialist events.