A data warehouse is usually made through the integration of data from various heterogeneous sources that are supportive of structured/ad-hoc queries, analytical reporting and decision making. In this post, we will discuss the various data warehouse concepts that you need to know.
What is a data warehouse?
Data warehouse is a type of data system that contains historical and commutative data from single or multiple sources. It simplifies the process of reporting and analysis in an organization. A data warehouse is also a single version of truth for any company when it comes to decision making and forecasting.
What are the characteristics of a data warehouse
A data warehouse is:
A data warehouse is subject oriented because it offers information regarding a topic rather than companies’ ongoing operations. These subjects can be things ranging from sales, marketing, distribution etc.
A data warehouse never focuses on the continued operations. Instead, it put emphasis on modeling and analysis of data for decision making. It also provides an easy and concise view around the specific subject by excluding data which is not really helpful to support the choice process.
In a data warehouse, integration means the establishment of a standard unit of measure for all similar data from the dissimilar database. The data also needs to be stored in the data warehouse in common and universally acceptable manner.
A data warehouse is developed by integrating data from varied sources sort of a mainframe, relational databases, flat files, etc. Moreover, it must keep consistent naming conventions, format, and coding. This integration helps in a very effective analysis of data. For absolute efficiency, a consistency in naming conventions, encoding structures and attribute measures must be ensured.
The time horizon for data warehouse is sort of extensive compared with operational systems. The data collected during a data warehouse is recognized with a specific period and offers information from the historical point of view. It contains an element of time, explicitly or implicitly.
One such place where data warehouse data displays time variance is in the structure of the record key. Every primary key contained within the data warehouse should have either implicitly or explicitly an element of time. Like the day, week, month, etc.
Another aspect of your time variance is that when data is inserted within the warehouse, it cannot be updated or changed.
Data warehouse is additionally non-volatile means the previous data isn’t erased when new data is entered in it.
Data is read-only and periodically refreshed. This also helps to research historical data and understand what & when happened. It does not require transaction process, recovery and concurrency control mechanisms.
Within a data warehouse environment, activities such as delete, update, and insert are omitted (these activities are never omitted in an operational application environment).
Only two types of data operations are performed in a data warehouse environment. They are:
- Data loading
- Data access
What are the types of Data Warehouse Architecture?
There are three types of data warehouse architecture:
The objective of a single layer is to minimize the amount of data stored. This goal is to remove data redundancy. This is rarely used type of architecture.
The way in which a two-layer architecture operates is by separating physically available sources and data warehouse. This architecture isn’t expandable and also not supportive of an outsized number of end-users. It also has connectivity problems due to network limitations.
This is the most widely used architecture.
It consists of the highest, Middle and Bottom Tier.
Bottom Tier: The database of the Datawarehouse servers because the bottom tier. It is usually a relational database system. Back end tools clean, transform and load data in this layer.
Middle Tier: The centre tier in Data warehouse is an OLAP server which is implemented using either ROLAP or MOLAP model. An abstracted view of the database is presented for a user by this application tier. This layer also acts as a mediator between the end-user and therefore the database.
Top-Tier: the highest tier may be a front-end client layer. Top tier is that the tools and API that you simply connect and obtain data out from the info warehouse. It might be Query tools, reporting tools, managed query tools, Analysis tools and data processing tools.
What are the components of a data warehouse?
The data warehouse is predicated on an RDBMS server which may be a central information repository that’s surrounded by some key components to form the whole environment functional, manageable and accessible
A data warehouse mainly consists of five components:
Data Warehouse Database
This database is implemented on the RDBMS technology and is the foundation of a data warehouse. Although, this type of implementation is constrained by the very fact that traditional RDBMS system is optimized for transactional database processing and not for data warehousing. For instance, ad-hoc query, multi-table joins, aggregates are resource intensive and hamper performance.
The following alternative approaches to database are quite useful:
- In a data warehouse, relational databases are deployed in parallel to allow for scalability. Parallel relational databases also allow shared memory or shared nothing model on various multiprocessor configurations or massively parallel processors.
- Relational table scan are bypassed and speed is improved using new index structures. Use of multidimensional database (MDDBs) to beat any limitations which are placed due to the relational data model. Example: Essbase from Oracle.
- Sourcing, Acquisition, Clean-up and Transformation Tools (ETL)
- The data sourcing, transformation, and migration tools are used for performing all the conversions, summarizations, and all the changes needed to transform data into a unified format in the data warehouse. They are also called Extract, Transform and cargo (ETL) Tools.
Their functionality includes:
- Anonymize data as per regulatory stipulations.
- Elimination of operational databases’ unwanted data from loading into a data warehouse.
- Search and replace common names and definitions for data coming back from different sources.
- Calculating summaries and derived data
- In case of missing data, populate them with defaults.
- De-duplicated repeated data arriving from multiple data sources.
- These Extract, Transform, and Load tools may generate cron jobs, background jobs, Cobol programs, shell scripts, etc. that regularly update data in data warehouse. These tools also are helpful to take care of the Metadata.
These ETL Tools have to deal with challenges of Database & Data heterogeneity.
What does metadata mean in a data warehouse?
Whenever we add the prefix ‘meta’ to anything, we automatically assume it to be an esoteric, avant-garde concept. However, it is quite simple. Metadata is data about data which defines the info warehouse. It is used for building, maintaining and managing the info warehouse.
In the Data Warehouse Architecture, meta-data plays a crucial role because it specifies the source, usage, values, and features of knowledge warehouse data. It also defines how data can be changed and processed. It is closely connected to the data warehouse.
Meta Data forms the essential ingredients within the transformation of knowledge into knowledge.
Metadata helps to answer the following questions:
- What are the keys, attributes and tables contained within the data warehouse? Where did the data come from?
- How many times do data get reloaded?
- What transformations were applied with cleansing?
Metadata can be classified into following categories:
Technical Meta Data: This is a type of metadata that holds the information of the warehouse. It’s mainly used by data warehouse administrators and designers.
Business Meta Data: This type of Metadata contains detail that provides end-users how easy to know information stored within the data warehouse.
What are query tools in a data warehouse?
One of the primary objects of data warehousing is to provide information to businesses to make strategic decisions. Query tools allow users to interact with the info warehouse system.
These tools fall into four different categories:
1.Query and reporting tools:
Query and reporting tools can be further divided into
Managed query tools
Reporting tools: Reporting tools can be further divided into production reporting tools and desktop report writer.
Report writers: This kind of reporting tool are tools designed for end-users for their analysis.
Production reporting: this type of tools allows organizations to get regular operational reports. It can also be used for printing, calculating and other such high volume batch jobs. Some popular reporting tools are Brio, Business Objects, Oracle, PowerSoft, SAS Institute.
Managed query tools:
This kind of access tools helps end users to resolve snags in database and SQL and database structure by inserting meta-layer between users and database.
2. Application development tools:
Sometimes built-in graphical and analytical tools do not satisfy the analytical needs of an organization. In such cases, custom reports are developed using Application development tools.
3. Data mining tools:
Data mining may be a process of discovering meaningful new correlations, patterns, and trends by mining a great deal of knowledge. Data mining tools are used to make this process automatic.
4. OLAP tools:
These tools are based on concepts of a multidimensional database. It allows users to analyse the data using elaborate and complex multidimensional views.