Driving Data Quality with Data Contracts by Andrew Jones
Author:Andrew Jones
Language: eng
Format: epub
Publisher: Packt
Published: 2023-11-15T00:00:00+00:00
Using a schema registry as the source of truth
The schemas weâve implemented can be used by both data generators and data consumers in several different applications.
Both Apache Avro and Protocol Buffers schemas can be used to generate source code. As binary formats, this code must be used by the data generators to write data that conforms to the schema and is serialized correctly. The data consumers also need to use the generated code to deserialize the binary representation into something their code can understand.
While JSON Schema events are serialized in the widely used and text-based JSON format, the schemas can be loaded by libraries to help write the data in the correct format and to run the validation checks.
These schemas can also be used in Continuous Integration (CI) checks, giving both the generators and consumers confidence that their code is using the data models correctly as they develop their services.
Furthermore, as open formats, these schemas can often be ingested into other applications or used to define resources such as a table in a data warehouse. We discuss these use cases in more detail later in this chapter, in the Defining governance and controls section.
When using the schemas across many different applications, we need to ensure they are kept in sync. So, when one application refers to version 1 of our Customer schema, that needs to be the same schema as the next application that refers to it.
We achieve this by creating a central service to store these schemas. This makes the schemas accessible to any application that needs them and acts as our source of truth for those schemas. We call this the schema registry.
Depending on our requirements, the schema registry can be as simple as a Git repository or a shared folder on a distributed filesystem such as Amazon S3, or a service that presents a rich API for the saving and retrieving of schemas and performing compatibility checks. Whatever we choose to use as our registry, it should be capable of the following:
Publishing a new schema
Publishing an updated version of an existing schema
Retrieving a schema with a particular version, including those superseded by a newer version
Retrieving the latest version of a schema
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Access | Data Mining |
Data Modeling & Design | Data Processing |
Data Warehousing | MySQL |
Oracle | Other Databases |
Relational Databases | SQL |
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(7857)
Learning SQL by Alan Beaulieu(5418)
Weapons of Math Destruction by Cathy O'Neil(5043)
Big Data Analysis with Python by Ivan Marin(3045)
Blockchain Basics by Daniel Drescher(2893)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(2534)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2528)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(2513)
Pandas Cookbook by Theodore Petrou(2507)
Mastering Python for Finance by Unknown(2483)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(2444)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(2226)
How The Mind Works by Steven Pinker(2216)
Data Engineering with dbt by Roberto Zagni(2097)
Driving Data Quality with Data Contracts by Andrew Jones(2088)
Building Machine Learning Systems with Python by Richert Willi Coelho Luis Pedro(2059)
Network Science with Python and NetworkX Quick Start Guide by Edward L. Platt(1983)
Python Natural Language Processing by Jalaj Thanaki(1893)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(1888)