Data governance is a term used on both a macro and a micro level. The former is a political concept and forms part of international relations and Internet governance; the latter is a data management concept and forms part of corporate data governance.
Data governance is a collection of processes, roles, policies, standards, and metrics that ensure the effective and efficient use of information in enabling an organization to achieve its goals. It establishes the processes and responsibilities that ensure the quality and security of the data used across a business or organization. Data governance defines who can take what action, upon what data, in what situations, using what methods.
A well-crafted data governance strategy is fundamental for any organization that works with big data, and will explain how your business benefits from consistent, common processes and responsibilities. Business drivers highlight what data needs to be carefully controlled in your data governance strategy and the benefits expected from this effort. This strategy will be the basis of your data governance framework.
Macro level
On the macro level, data governance refers to the governing of cross-border data flows by countries, and hence is more precisely called international data governance. This new[when?] field consists of “norms, principles and rules governing various types of data.”
Micro level
Here the focus is on an individual company. Here data governance is a data management concept concerning the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data, and data controls are implemented that support business objectives. The key focus areas of data governance include availability, usability, consistency, data integrity and data security and includes establishing processes to ensure effective data management throughout the enterprise such as accountability for the adverse effects of poor data quality and ensuring that the data which an enterprise has can be used by the entire organization.
A data steward is a role that ensures that data governance processes are followed and that guidelines enforced, as well as recommending improvements to data governance processes.
Data governance encompasses the people, processes, and information technology required to create a consistent and proper handling of an organization’s data across the business enterprise. It provides all data management practices with the necessary foundation, strategy, and structure needed to ensure that data is managed as an asset and transformed into meaningful information. Goals may be defined at all levels of the enterprise and doing so may aid in acceptance of processes by those who will use them. Some goals include
- Increasing consistency and confidence in decision making
- Decreasing the risk of regulatory fines
- Improving data security, also defining and verifying the requirements for data distribution policies
- Maximizing the income generation potential of data
- Designating accountability for information quality
- Enable better planning by supervisory staff
- Minimizing or eliminating re-work
- Optimize staff effectiveness
- Establish process performance baselines to enable improvement efforts
- Acknowledge and hold all gain
Benefits of Data Governance
An effective data governance strategy provides many benefits to an organization, including:
A common understanding of data: Data governance provides a consistent view of, and common terminology for, data, while individual business units retain appropriate flexibility.
Improved quality of data: Data governance creates a plan that ensures data accuracy, completeness, and consistency.
Data map: Data governance provides an advanced ability to understand the location of all data related to key entities, which is necessary for data integration. Like a GPS that can represent a physical landscape and help people find their way in unknown landscapes, data governance makes data assets useable and easier to connect with business outcomes.
A 360-degree view of each customer and other business entities: Data governance establishes a framework so an organization can agree on “a single version of the truth” for critical business entities and create an appropriate level of consistency across entities and business activities.
Consistent compliance: Data governance provides a platform for meeting the demands of government regulations, such as the EU General Data Protection Regulation (GDPR), the US HIPAA (Health Insurance Portability and Accountability Act), and industry requirements such as PCI DSS (Payment Card Industry Data Security Standards).
Improved data management: Data governance brings the human dimension into a highly automated, data-driven world. It establishes codes of conduct and best practices in data management, making certain that the concerns and needs beyond traditional data and technology areas including areas such as legal, security, and compliance are addressed consistently.
Data policies and procedures
A data governance policy is a documented set of guidelines for ensuring that an organization’s data and information assets are managed consistently and used properly. Such guidelines typically include individual policies for data quality, access, security, privacy and usage, as well as roles and responsibilities for implementing those policies and monitoring compliance with them.
A data governance policy should articulate the principles, practices and standards that organizational leaders have determined necessary to ensure the organization has high-quality data and that its data assets are protected. The policy-forming group, called a data governance committee or data governance council, is primarily made up of business executives and other data owners.
The policy document this group creates clearly defines the data governance structure for the executive team, managers and line workers to follow in their daily operations.
A data governance policy formally outlines how data processing and management should be carried out to ensure organizational data is accurate, accessible, consistent and protected. The policy also establishes who is responsible for information under various circumstances and specifies what procedures should be used to manage it. In addition, it can incorporate risk management and data ethics principles to reduce potential business problems from the use of data.
Data governance is not a new concept by any stretch of the imagination, but it has come into sharp focus as the world’s data footprint continues to grow exponentially. Today, organizations not only must adhere to strict data policies and regulations (i.e. Sarbanes-Oxley Act, Basil Accord, HIPAA, Government agencies, GDPR), but they’re also looking to build a data governance strategy to better manage and properly safeguard their data as a valuable organizational asset.
Forces:
Maintenance
Nobody likes to talk about maintenance because, much like data governance, it’s not new. However, it still has its place, marking the difference between being organized or unorganized, between saving time and resources or wasting them and losing opportunities. All existing data in an organization’s infrastructure must be maintained; the more you have, the more of an effort it is to maintain it.
Risk Beyond Regulation
In addition to policy risk and regulations which mandate companies to safeguard certain data in a specific way, organizations are now facing the risk that their most valuable possession, their data, isn’t being properly handled. Access rights may be too lenient, there might be no data lineage, or they simply don’t know what exists in their infrastructure. With customer data being the most valuable asset for successful targeted marketing campaigns, it’s clear these three types of risks can have real repercussions.
Some incremental steps to get your process on the right track:
- Determine the senior leaders you trust with creating/updating your policy. Generally, this will include at least a senior leader from IT, business, and management, sharing knowledge from different areas of expertise.
- The data governance team should assess all areas of operational risk with respect to the data and come up with a plan for using your existing data.
- Determine the plan and implementation strategy with the operational risks clearly communicated and addressed. If you’re implementing a new policy, I highly recommend determining how you can automate the entire process. This should also include a plan for maintaining all systems and their data.
- Implement the changes and put your governance strategy into practice.
- Re-assess and change course, if needed.
Life cycle of data
The lifecycle of data starts with a researcher or a team creating a concept for a study, and the data for that study is then collected once a study concept is established. After data is obtained, it is prepared for distribution to be archived and used by other researchers at a future stage. When data enters the distribution point of the life cycle, it is contained in a location where other researchers can then discover it.
The five stages are as follows:
- Obtaining the Data: This stage involves using technical knowledge like MySQL to process and generate the data. It can even be in simpler file formats such as Microsoft Excel. Some examples like Python and R even directly import the datasets into a data science program.
- Scrubbing the Data: This stage involves cleaning raw data to retain only the relevant part of the processed data. The noise is also scrubbed off, and the data is refined, converted, and consolidated.
- Exploring the Data: This stage consists of examining the generated data. The data and its properties are inspected since different data types demand specific treatments. Descriptive statistics are then computed to extract the features and test the significant variables.
- Modelling the Data: The dataset is refined further, and only the essential components are kept. Only relevant values are kept and tested to predict accurate results.
- Interpreting the Data: At this stage, the final product is interpreted for the client or business to analyze if it meets the requirement or answers a business question. The insights are shared with everyone, and the results of the final stage are visualized.
Requirements
Data Generation and Understanding: The available data which can be used and the data which needs to be generated is analyzed and discussed. This is one of the fundamental data science life cycle steps as it deals with understanding the data requirement and gathering the data.
Data Preparation: This part of the process deals with preparing the raw data by cutting out the noise and irrelevant information. This is a time-consuming process because it deals with the cleaning and fine-tuning of data from datasets that are relevant and won’t lead to the corruption of the model.
Modeling of Project: The project is modeled, and different variations are tried out before deciding upon the final one with statistical and analytical means.
Evaluation of Model: This stage deals with finding out if the model is good enough before deployment. It is checked if the model can tackle a business problem or serve the business requirement.
Deployment of Model and Communication: The model is deployed and monitored. Basic communication is done regarding the model in regards to optimization and maintenance.