Home > Computers & Technology > Databases & Big Data > Data Warehousing

Driving Data Quality with Data Contracts by Andrew Jones

Author:Andrew Jones , Date: November 20, 2023 ,Views: 6315

Driving Data Quality with Data Contracts by Andrew Jones

Author:Andrew Jones
Language: eng
Format: epub
Publisher: Packt
Published: 2023-11-15T00:00:00+00:00

Using a schema registry as the source of truth

The schemas weâve implemented can be used by both data generators and data consumers in several different applications.

Both Apache Avro and Protocol Buffers schemas can be used to generate source code. As binary formats, this code must be used by the data generators to write data that conforms to the schema and is serialized correctly. The data consumers also need to use the generated code to deserialize the binary representation into something their code can understand.

While JSON Schema events are serialized in the widely used and text-based JSON format, the schemas can be loaded by libraries to help write the data in the correct format and to run the validation checks.

These schemas can also be used in Continuous Integration (CI) checks, giving both the generators and consumers confidence that their code is using the data models correctly as they develop their services.

Furthermore, as open formats, these schemas can often be ingested into other applications or used to define resources such as a table in a data warehouse. We discuss these use cases in more detail later in this chapter, in the Defining governance and controls section.

When using the schemas across many different applications, we need to ensure they are kept in sync. So, when one application refers to version 1 of our Customer schema, that needs to be the same schema as the next application that refers to it.

We achieve this by creating a central service to store these schemas. This makes the schemas accessible to any application that needs them and acts as our source of truth for those schemas. We call this the schema registry.

Depending on our requirements, the schema registry can be as simple as a Git repository or a shared folder on a distributed filesystem such as Amazon S3, or a service that presents a rich API for the saving and retrieving of schemas and performing compatibility checks. Whatever we choose to use as our registry, it should be capable of the following:

Publishing a new schema

Publishing an updated version of an existing schema

Retrieving a schema with a particular version, including those superseded by a newer version

Retrieving the latest version of a schema

Download

Driving Data Quality with Data Contracts by Andrew Jones.epub

Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.

Categories

Access	Data Mining
Data Modeling & Design	Data Processing
Data Warehousing	MySQL
Oracle	Other Databases
Relational Databases	SQL