Driving Data Quality with Data Contracts by Andrew Jones
Author:Andrew Jones
Language: eng
Format: epub
Publisher: Packt
Published: 2023-11-15T00:00:00+00:00
Using a schema registry as the source of truth
The schemas weâve implemented can be used by both data generators and data consumers in several different applications.
Both Apache Avro and Protocol Buffers schemas can be used to generate source code. As binary formats, this code must be used by the data generators to write data that conforms to the schema and is serialized correctly. The data consumers also need to use the generated code to deserialize the binary representation into something their code can understand.
While JSON Schema events are serialized in the widely used and text-based JSON format, the schemas can be loaded by libraries to help write the data in the correct format and to run the validation checks.
These schemas can also be used in Continuous Integration (CI) checks, giving both the generators and consumers confidence that their code is using the data models correctly as they develop their services.
Furthermore, as open formats, these schemas can often be ingested into other applications or used to define resources such as a table in a data warehouse. We discuss these use cases in more detail later in this chapter, in the Defining governance and controls section.
When using the schemas across many different applications, we need to ensure they are kept in sync. So, when one application refers to version 1 of our Customer schema, that needs to be the same schema as the next application that refers to it.
We achieve this by creating a central service to store these schemas. This makes the schemas accessible to any application that needs them and acts as our source of truth for those schemas. We call this the schema registry.
Depending on our requirements, the schema registry can be as simple as a Git repository or a shared folder on a distributed filesystem such as Amazon S3, or a service that presents a rich API for the saving and retrieving of schemas and performing compatibility checks. Whatever we choose to use as our registry, it should be capable of the following:
Publishing a new schema
Publishing an updated version of an existing schema
Retrieving a schema with a particular version, including those superseded by a newer version
Retrieving the latest version of a schema
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Access | Data Mining |
Data Modeling & Design | Data Processing |
Data Warehousing | MySQL |
Oracle | Other Databases |
Relational Databases | SQL |
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8003)
Learning SQL by Alan Beaulieu(5678)
Weapons of Math Destruction by Cathy O'Neil(5316)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(4329)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(4320)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(4211)
Big Data Analysis with Python by Ivan Marin(4016)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(3976)
Driving Data Quality with Data Contracts by Andrew Jones(3891)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(3679)
Data Engineering with dbt by Roberto Zagni(3063)
Blockchain Basics by Daniel Drescher(3019)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2732)
Solidity Programming Essentials by Ritesh Modi(2694)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2655)
Feature Store for Machine Learning by Jayanth Kumar M J(2653)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(2643)
Pandas Cookbook by Theodore Petrou(2612)
Mastering Python for Finance by Unknown(2597)