What is a Data Warehouse?
Data Warehouse is a collection of information and data derived from operational systems and external data sources.
A data warehouse is designed with the purpose of inducing business decisions by allowing data consolidation, analysis, and reporting at different aggregate levels.
Data is populated into the DW by extraction, transformation, and loading.
Data Warehousing incorporates data stores and conceptual, logical and physical models to support business goals and end-user information needs.
Creating a DW requires mapping data between sources and targets, then capturing the details of the transformation in a metadata repository.
The data warehouse provides a single, comprehensive source of current and historical information.
Characteristics of Data Warehouse
The characteristics of Data warehouse are
1. Subject-oriented
2. Integrated
3. Time-variant
4. Non-volatile collection of data
Subject-Oriented:
A data warehouse is subject-oriented as it offers information regarding a subject instead of companies ongoing operations. These subjects can be sales, marketing, distributions, etc.
A data warehouse never focuses on ongoing operations. Instead, it puts emphasis on modelling and analysis of data for decision making. It also provides a simple and concise view around the specific subject by excluding data which is not helpful to support the decision process.
Integrated:
In Data Warehouse, integration means the establishment of a common unit of measure for all similar data from the dissimilar database.
The data also needs to be stored in the Data warehouse in a common and universally acceptable manner.
A data warehouse is developed by integrating data from varied sources like a mainframe, relational databases, flat files, etc. Moreover, it must keep consistent naming conventions, format, and coding.
This integration helps in effective analysis of data. Consistency in naming conventions, attribute measures, encoding structure, etc. have to be ensured.
Consider the following example:
In the above example, there are three different applications labelled A, B and C. Information stored in these applications are Name of the user, Gender and Income. However, each application’s data is stored in a different way.
In Application A gender field store logical values like M or F, the name of the user is mentioned as user
In Application B gender field is a numerical value
In Application C application, income field is mentioned as salary
However, after the transformation and cleaning process, all this data is stored in a common format in the Data Warehouse.
Time-Variant:
The data collected in a data warehouse is recognized with a particular period and offers information from the historical point of view.
It contains an element of time, explicitly or implicitly.
One such place where Data warehouse data display time variance is in the structure of the record key.
Every primary key contained with the DW should have either implicitly or explicitly an element of time. Like the day, week month, etc.
Another aspect of time variance is that once data is inserted in the warehouse, it can’t be updated or changed.
Non-volatile:
Data warehouse is also non-volatile. It means the previous data is not erased when new data is entered into it. Data is read-only and periodically refreshed. This also helps to analyze historical data and understand what and when it happened. It does not require transaction process, recovery and concurrency control mechanisms.
Activities like delete, update, and insert which are performed in an operational application environment are omitted in Data warehouse environment.
Only two types of data operations performed in the Data Warehousing are
1.Data loading
2.Data access
Data Warehouse Architecture
Operational systems: Capture the transactions of the business. We have little to no control over the content and format of the data in these operational legacy systems.
The main priorities of the source systems are processing performance and availability. Queries against source systems are narrow, one-record-at-a-time queries that are part of the normal transaction flow and severely restricted in their demands on the operational system.
Staging Area: The Data Warehouse Staging Area is a temporary location where data from source systems is copied.
A staging area is mainly required in a Data Warehousing Architecture for timing reasons. In short, all required data must be available before data can be integrated into the Data Warehouse.
Due to varying business and data processing cycles, hardware and network resource limitations and geographical factors, it is not feasible to extract all the data from all Operational databases at exactly the same time.
For example, it might be reasonable to extract sales data on a daily basis, however, daily extracts might not be suitable for financial data that requires a month-end reconciliation process.
Warehouse: The purpose of the warehouse is to store all the data in one place. The amount of data in the warehouse is massive. Data is stored at a very granular level of detail.
For example, every “sale” that has ever occurred in the organization is recorded and related to dimensions of interest.
This allows data to be sliced, summed and grouped in unimaginable ways.
Data Marts: Users don’t access the Data Warehouse directly.
This is done through various front-end tools that read data from subject-specific Data Marts. For example “Tableau”.
Importance of Data Warehousing
Ensures data consistency
Data warehouses store data from various sources, and that data is in multiple formats. Hence, they are programmed to apply ETL methods to ensure that the data is overall consistent.
Consistency is what makes data warehousing a perfect tool for corporate decision-makers to analyze and share data insights with their colleagues around the globe. Standardizing and formatting the data also reduces the risk of errors while data analysis; thereby providing overall better accuracy.
Facilitate better decisions
“First comes data, then theories.” A data warehouse allows organizations to store and retrieve data with ease thereby ensuring better theories and strategies around that data.
Data warehousing is also a lot faster regarding accessing different data sets and makes it easier to derive actionable insights.
Improve their bottom line
A data warehouse helps in the improvement of overall operations of any organization by allowing the stakeholders to dive into their historical data. This, eventually, enables business leaders to quickly track their organization’s past activities and evaluate successful (or unsuccessful) strategies. This allows executives to see where they can adjust their approach to decrease costs, maximize efficiency and increase sales to improve their bottom line.
References:
https://docs.oracle.com
https://www.1keydata.com/datawarehousing/concepts.html