The Burden of Managing a Spaghetti Architecture
Author : Wriddhi Majumder
Analytics is Important and so are the Storages
Today, every organization are doing something or other in the line of analytics, AI, ML, etc. to gain strategic advantages from the insights. Data being the fuel for all these, it is important for these organizations to build data storages meant for analytical use cases. There are multiple options that the organizations typically opt for such as Data Warehouse, Data Marts, etc. to build the OLAP storages for analytical purposes. Having multiple independent Data Marts is a common practice due to the following reasons:
- Data Mart is focused on a particular function or department. Hence, when a particular department gets interested in implementing a dashboard, having a data mart is a convenient option
- The design process of the Data Mart is easy and simple. Building independent data marts are less expensive than architected data warehousing systems. This is one of the major reasons why building independent data marts is a common practice.
- The implementation timeline of a data Mart is much shorter than that of an EDW. An Independent data mart can be constructed fairly quickly and do not require a company to really understand their data beyond that of individual departments as a data warehouse requires.
Unfortunately, it is this lack of thorough analysis and long-term planning that limits the independent data marts from being an effective business intelligence system. With time, the number of data marts increases and the complexity starts increasing exponentially. With multiple data sources, and multiple departments using the dashboards and analytics, the overall architecture becomes complicated and inefficient. Here lies the importance and benefits of having an Enterprise Data Warehouse, which provides an organization-wide standardized data platform meant for analytics. This blog will explain some of the challenges associated with independent data marts and the migration steps towards an EDW.
What are the Challenges with Independent Data Marts?
Implementation of a data mart is an easy way to kick start an analytics project for any organization. Especially, during the early stages, where the number of data marts are limited; it seems an easy way to design and develop the data marts. However, Data Mart is often considered as a short-cut approach, which increases complexity once the organization becomes matured in terms of data usage. Following are the Issues that are generated due to increase of Data Marts:
Replication of Data
The amount of redundant data begins to grow uncontrollably across the enterprise as the number of independent data marts grow. The major reason behind the redundancy because each of the independent data marts requires its own, typically duplicated copy of the detailed corporate data. Often a huge percentage of this detailed data remains unutilized in the data marts, which typically provide summarized views. The unwanted and redundant data piles up to GBs and TBs which increases the complexity and cost unnecessarily. This has an impact on the IT Budgets, process efficiency as well as data management.
An enterprise data warehouse provides the architecture to centralize the data integration and cleansing activities common to all of the data marts of a company. Without the data warehouse all of these data integration and cleansing processes need to be duplicated for all of the independent data marts. This greatly increases the maintenance overhead of data marts which increases complexities, probabilities of errors, lack of standardization. Also, this requires for more effort and hence more support staff to maintain the data systems. Such complex and duplicated data system typically results into lots of reworks and reanalysis.
The volume of data is continuously growing. Hence the scalability of the OLAP system is of extreme importance. The modern data warehouses are easily scalable in terms of both storage and compute. On the other hand, a traditional data mart often faces issues in scaling the compute and storage; which impacts the performance and SLAs directly.
There is a limitation in terms of resourcing factor of data marts as Independent data marts directly read operational system files and/or tables. For example, if a company has eight independent data marts each of which takes customer information from the CRM database. Therefore, there would be eight separate extracts being pulled off of the same customer tables of the CRM in the operational system of record. Most operational systems have limited batch windows and cannot support this number extracts. With a data warehouse only one extract is required in the operational system of record.
Each independent data mart is built by autonomous teams, typically working for separate departments. As a result, these data marts are not integrated and none of them contain an enterprise view of the corporation. Therefore, there are many cases where data marts can provide different answers as they are not representing a ‘single source of truth’.
You are going to end up with a Spaghetti Architecture
An example of Independent Data Mart Architecture
As discussed, the data architecture including independent data marts become increasingly complex and redundancy creeps in. The following image represents an example of ‘Spaghetti’ architecture created due to independent data marts
Ideal Data Warehouse Architecture
The ideal way is to opt for an Enterprise Data Warehouse which will act as a central repository. Proper planning and architecture need to be done for building an EDW. An indicative architecture has been shown below:
Hence it is essential that the independent data mart structure needs to be migrated to a centralized data warehouse structure.
The big Question: How to migrate?
There are two general approaches for migration; “Big Bang” and “Iterative”. The pros and cons of both the methods are as below:
Big Bang Approach
All of the independent data marts will be reengineered simultaneously into a structured data warehousing architecture. There are a couple of advantages to this approach. First, it can provide the fastest path for migration. Often companies will need to change their data warehousing architecture as quickly as possible (regardless of risk) because
- There is a need to implement additional data warehousing projects that are required to meet government regulations
- There are data warehousing projects that promise to lend a high return on investment (ROI) and/or
- There are currently funds available for the integration effort that might not be available at a later date.
Second, this approach allows for immediate economies of scale rather than slowly attaining them in Iterative method.
The disadvantages to this approach are that it is labor intensive and requires tremendous coordination. In addition, the “Big Bang” approach is the more complexed of the two to implement and thus provides the highest exposure. This approach is best suited when the independent data mart problem is relatively small and not highly complexed. However, when the problem is large the complexity of the migration grows at a tremendous rate.
This approach looks to reengineer the independent data marts (one or two data marts at a time) in manageable phases. The advantages to this approach are several. First, it allows a government agency or company to manage and reduce the risk involved in a migration effort. This occurs because the migration can be accomplished in a phased manner thereby increasing the probability of the project’s success. Second, as each project phase is executed lessons are learned and leveraged for subsequent phases. This is very valuable as typically once the first phase is completed the follow up phases run much more smoothly. The major disadvantage to this approach is that it takes longer to fully complete the migration. This approach is best used when the independent data mart problem is large and too complexed to tackle in a “Big Bang” manner.
So… Which approach to be followed?
Before deciding which approach should you follow, it is essential to have a quick look at the pros and cons of both the approaches
|Big Bang Approach||Provides the potentially fastest path for migration
Allows immediate economies of scales
|Labour IntensiveProper coordination is required
Complex parallel testing
Risk is high
|Iterative Approach||The risk is pretty low.
The organizations can leverage the lessons learned
Does provide the eventual economies of scale
|Potential migration time is elongated
Multiple development efforts need to be managed and coordinated
The Stages of Migration
The typical stages and activities associated with a data mart migration are as below:
Initial Planning and Project Champion
Initial planning is extremely important for a migration effort to become successful. Proper budgeting is one of the most important tasks at the onset of the project. Also, the migration project should be driven by a Project Champion (or a central group). This is critical as typically each of the independent data marts have been constructed by autonomous teams in different corporate departments. Therefore, having a project champion that has cross-departmental authority is critical for dealing with the organizational challenges, which are common in these migration efforts.
Implementation of Metadata Repository
During the initial planning phases, it is important to plan on implementing a meta data repository that can support future data warehousing development efforts and that will provide a semantic layer between the business users and the data warehousing system. The data mart migration provides an outstanding opportunity to implement the meta data repository. Before the data mart migration begins it is best to standardize the data naming nomenclature for the data warehousing system. By implementing standard data naming nomenclature, it will aid in the system’s maintenance and provide cleaner and more understandable meta data.
As Is Study
A great deal of research needs to be conducted on the independent data marts before a migration is possible. The most important research activity is to understand the business needs that each independent data mart is meeting. Typically, multiple independent data marts will exist to meet the same or similar business needs. These situations are common and do suggest a path for migration. The results of this research will illustrate the independent data marts that will be the most difficult to migrate.
Tasks to be performed for As-Is Study of Independent Data Mart
- Available Meta Data (both Technical and Business)
- To identify the ROI of the Data Mart
- Amount of data (raw and curated) in each data mart
- Data Refresh/ Update criteria
- Archival Policy
- Identifying the key users of the data mart
- Understanding the business requirement from the data mart
- Identification of Transformation Rules
- Identification of the Error checking and data cleansing mechanism
- How is the ETL Process constructed
Software/ Hardware Classification
During independent data mart migration, it is an excellent time to standardize on hardware, and software for the data warehousing project. For each differing software or hardware platform a company needs to have trained personnel to support it. Therefore, by limiting the redundant software/hardware the corporation reduces the support strain on their IT staff. In addition, standardizing allows for software and hardware purchasing economies of scale can be achieved. Some of the things that need to be finalized are as below:
- Hardware (Unix, Mainframe, Cloud)
- ETL Tools (Informatica, Talend, etc.)
- Environment (Cloud, On-Prem)
- Data Quality Tool
- Data base
Create the existing Independent Data Mart Chart
First, diagram out the current data mart architecture. This is critical for identifying which legacy systems are feeding which independent data marts and for showing the problems with this architecture (or lack thereof).
Identify Redundant Data
Often independent data marts will be sourced from the same legacy systems. By targeting independent data marts with the same source data often multiple independent data marts can be removed with minimal extra effort. Identifying redundant data often suggests a migration path. For example, the following architecture shows that both Finance and Marketing Data Marts are created from two source systems. Hence, in an iterative approach, the organization can target these two data marts first for migration.
The Identification Process
Most important Dataset: It is important to target those independent data marts whose data will most likely be used in future data warehousing efforts. By targeting these data marts first, it will ease the task of keeping all new data warehousing development activity in the new architected environment.
Well Documented/ Managed Data Marts: The next step is to identify those data marts whose transformation rules are known and documented (although even the best documented transformation rules will have gaps).
Organization-wide priority: It will be critical to obtain support from the current independent data mart IT teams and business users. Identify those departments most likely to work cooperatively with the centralized data warehousing team.
Phase-wise Migration Process
Once the previous steps are finalized, the data marts can be migrated in the finalized data warehouse architecture in a phased manner. The design document should be prepared before the actual migration, which will have the blueprint of the data warehouse for this phase. Once, the migration is done, necessary testing has to be done to ensure that the result is as per the requirement. Once, the migrated model is good to go, the independent data marts should be discontinued. Once, the new architecture gets stable, the next phase can be considered where one or more data marts can be replaced and migrated to the existing data warehouse.
A Strategic Decision
Migration of data marts to a centralized data warehouse is a strategic decision which needs time, effort, planning, and money. Organizations should be careful to not take any short-cut approach in this regard; or else they will end up replicating the same problem that they are trying to address. Thorough As-Is study, To-Be Study, Gap Analysis, Roadmap preparation, Data Warehouse Architecture Design, etc. should be carried out for such Migration Project.