What is Data Lakes Market
Data lake as the basis for Industry 4.0
For a long time, the data warehouse was the central source for all data analyzes. In the course of increasing digitalization and the associated mass of available data volumes, the data lake has now overtaken the classic data warehouse. Numerous use cases from the context of Industry 4.0 are inconceivable without a suitable data platform based on the data lake concept.
The data architecture
The right architecture for dispositive data processing had been clearly defined since the 1990s. An ideally singular (enterprise) data warehouse collects the relevant data from the different operative source systems in a hub & spoke approach and harmonizes, integrates and persists this in a multi-layered data integration and data refinement process. A single point of truth should arise from a data perspective, from which data extracts - usually in a multidimensional format - are then stored in definable data marts for different applications. The user accesses this treasure trove of data via reporting and analysis tools (business intelligence). The primary focus is on the more backward-looking analysis of key figures along with consolidated evaluation structures.
Characteristics of a data warehouse
An essential characteristic of the data warehouse is to represent a valid and consolidated truth about all structured data in a company. In addition to this uniform view of the company data, the data warehouse provides the data for evaluation in an optimized manner in a strict and previously defined data model. This high standards of correctness and degree of harmonization usually led to the fact that it takes a long time until data from a new data source is integrated into the consolidated view, because a lot of design and coordination effort is required in advance.
Fast data preparation with the data lake
With the emergence of new data sources such as social media or IoT data, the need to make these available on a data platform increased. Much of this data is then available in semi-structured or unstructured form. With the increasing relevance of these data sources, the idea of the data lake was born. The data lake would like to make all source data - internal and external, structured and polystructured - available in their unprocessed form as raw data in order to have them available as quickly as possible. The efficient handling of large amounts of data, a fast processing of data streams and the mastery of complex analyzes are in the foreground of the data lake at the expense of the harmonization and integration of the data.
Data warehouse vs. data lake
Compared to the data warehouse, the data lake is more like the Integration of diverse data sources with the greatest possible agility and flexibility in the foreground, in order to create the database for a variety of advanced data analyzes that are usually not yet defined at the time of data storage. The data lake is the Eldorado for the data scientist who wants to carry out exploratory analyzes such as cluster / association analyzes, simulations and predictions using complex algorithms. It is therefore also clear that a data lake does not replace a data warehouse, but complements it. Both architecture concepts are relevant and serve different use cases from one another.
|Data warehouse||Data lake|
|Data||- No raw data storage|
- Structured data
- Schema-on-write: data is transformed into a specific schema before being loaded into the data warehouse
|- Raw data storage|
- Flexible with regard to the data structure (unstructured and structured)
- Schema-on-read: Automatic recognition of the schema during the reading process
|processing||The data layer and the processing layer are inextricably linked||Very flexible because there are different frameworks for processing for different areas of responsibility|
|Analytics||Descriptive Statistics||Advanced Analytics|
|Agility||- Lower agility|
- Fixed configuration
- Ad-hoc analyzes are not possible
|- High agility|
- Customizable configuration
- Ad-hoc analyzes possible
|security||Mature||- Due to the multitude of technologies that are used within a data lake, multiple configurations are necessary|
- Security policies are more complex
There are two major technical drivers for the use of data lakes in industry: The Optimization of the production and the Offer better or new products, partly also completely new business models. The basic use cases here are the “digital twin”, i.e. the digital image of the machines produced in-house and the connection of these to the data lake with almost real-time data up-to-date. There are two major obstacles to be overcome in practice: The master data required for materials and components are stored in systems from different organizational units that have not yet been mechanically communicating with one another. In addition, different technical protocols are used at the technical level, so that communication components must first be retrofitted as a prerequisite for data availability.
The first generation data lakes were systems based on the Apache Hadoop stack in the company's own data center. With these early platforms, the complexity of the technology, consisting of numerous open source components and the connection with the required timeliness, were also challenging. Due to the change in the market situation of commercial distribution providers and the general strategy of increased use of the cloud, this is shifting in the case of data lakes of the second generation: At Use of native cloud services and or dedicated managed Hadoop environments The complexity of the management of the base platform is massively simplified. Thus, the entry barrier has fallen and today it can be used for almost any company size.
However, the recommendation remains valid to only use technology when a clear use case evaluation and prioritization on a roadmap has been defined as the cornerstone of the application!
Choosing the right technology
The decision to select the components to be initially used must be carefully considered and alternatives from the market for commercial, open source and cloud services options must be continuously searched for and evaluated in order to be able to create optimal added value for the company.
When selecting the components to be selected for your own company, in addition to the functional requirements in industrial use, the protection of trade secrets from (global) competitors and legal aspects, such as the use of the platform with data from countries in which legal Restrict data exchange geographically, in the foreground. A special feature of machine manufacturers is the additional challenge of accessing the data of their own machines in the customer context, since machines from different manufacturers are often used in combination and customers, in turn, do not disclose all data to protect their company.
Another area of tension is the requirements of productive use cases versus the needs of data science users. Here, too, the approach has changed over time: If you first tried to build platforms that could serve all usage profiles - from the provision of an API for a customer portal with high response time requirements to complex analytical queries - the breakdown has changed this proved to be more practicable on various technical platforms.
The key conditions in practice
When setting up a data lake initiative, key conditions are found in practice as the basis for successful implementation, which are similar to those of the implementation of a central data warehouse: A strong management decision for setting up and using a central platform initiative and the resulting, in many cases not yet implemented , close cooperation between specialist and production IT, possibly also product development, are elementary. In this way, not only diverse data are brought together, but also knowledge of this data, such as the signals from individual sensors and the interpretation of the states as a system, is combined.
Last but not least, the operation of a data lake must be set up flexibly and holistically: A DevOps team has proven itself as best practice, which continuously develops the platform and keeps it stable in operation.
In conclusion, it can be summarized that every Industry 4.0 initiative needs a data lake platform. The technological entry barrier has fallen, but still requires well-founded planning of the architecture. The basis should be a roadmap for use in use cases. In order to maximize the resulting long-term added value, it is also important to create the necessary organizational prerequisites for the successful use of a data lake platform in addition to the technology.
Dr. CARSTEN DITTMAR
- Will Antipsychotics Seroquel make me gain weight
- Spaced Repetition What is an MCD
- What do you think of loving boys
- Do you like eminem music
- How can you adopt for free
- Go girls to gay clubs for men
- Is Laravel overrated
- How hard is carbon steel
- Which animals have a symmetrical anatomy
- How difficult is the concept of infinity
- How does RSA decryption work
- Is there a wine culture in Dubai
- Why should I write
- What is the best winter hiking shoe
- Traceroute will work when ICMP is blocked
- What is the feminine end of poetry
- How to make garlic butter steak
- Is there a state CA college
- How to prepare for my job
- Why does iron rust in hydrochloric acid
- What is short term production
- Which companies are hiring freshmen for electrical engineers
- Are Petri nets worth learning
- Are provocative ads better than descriptive ads