Tutorial ======== In this tutorial, we will go over the creation of Data Vault structures, and their automatic loading through the framework. Some pre-requisites are required to get started: - General knowledge about the `Data Vault 2.0 `_ modeling technique. - :doc:`A working installation of diepvries `. - An access to a `Snowflake instance `_. You should also create a database and schemas to host the various structures and data: .. code-block:: sql CREATE DATABASE diepvries_tutorial; USE DATABASE diepvries_tutorial; CREATE SCHEMA dv; CREATE SCHEMA dv_extract; CREATE SCHEMA dv_staging; Three schemas are used in this example: - ``dv``: Where the Data Vault model is stored. - ``dv_extract``: Where the raw, extracted data lies. The tables it contains are used as input to the framework. - ``dv_staging``: Staging area for data, used internally by the framework. Data model ---------- Let's assume a simple data model consisting of customers and orders. We need Data Vault entities to represent them, let's dive into their creation and some naming conventions. Customers +++++++++ We start with customers: .. literalinclude:: snippets/h_customer.sql :language: sql We created two tables: - ``dv.h_customer``: A hub, whose table name is prefixed with ``h_``. It contains multiple columns: - ``h_customer_hashkey``: The hashkey for this entity. It is the primary key for this table. - ``r_timestamp``: The creation timestamp for a record. - ``r_source``: The data source for a record. - ``customer_id``: The business key for a record. - ``dv.hs_customer``: A hub satellite, whose table name is prefixed with ``hs_``. It contains multiple columns: - ``h_customer_hashkey``: The foreign key pointing to the hub. - ``s_hashdiff``: The hashdiff for a record. - ``r_timestamp``: The start timestamp for a record, it represents the beginning of that record's validity. - ``r_timestamp_end``: The end timestamp for a record, it represents the end of that record's validity. - ``r_source``: The data source for a record. - ``firstname`` and ``lastname``: Some actual values, and not metadata. They are the first and last names of a customer. The primary key for the hub satellite is ``(h_customer_hashkey, r_timestamp)``: The combination of a hub record and a given timestamp. At this point, you might have noticed the strong use of :doc:`naming conventions `, with prefixes and suffixes. Orders ++++++ Following those principles, let's create a hub and a hub satellite to represent orders: .. literalinclude:: snippets/h_order.sql :language: sql This is of course a simplified model: we assume a business selling only one product to customers; who can buy a different quantity of that product. Link ++++ The next step is to link customers and orders together. Let's create a link: .. literalinclude:: snippets/l_order_customer.sql :language: sql Of course, if that link had fields associated with it, we would create a link satellite. But for the purpose of this tutorial we will skip this part. Extraction table ---------------- The framework assumes you have raw data somewhere in a table, extracted from an operational system. For the purpose of this tutorial, let's generate some data. We'll pretend it comes from the operational system handling customers and orders. .. literalinclude:: snippets/extract.sql :language: sql Notice the last item: ``customer_id`` and ``lastname`` are ``NULL``. diepvries graciously handles ``NULL`` values, as we'll see further in this tutorial. Loading data ------------ Now comes the interesting part. We have prepared hubs and hub satellites for customers and orders, and have raw data waiting to be ingested in the Data Vault! Everything is set up for the framework to do its magic. Let's move from SQL to Python, and have a look at this script: .. literalinclude:: snippets/dv_load.py :language: python This is the source code to ingest into the Data Vault the data coming from our (fake) operational system. It concerns both entities, customers, and orders, hence will load data into both hubs and both satellites. A few things are going on in this script: - We prepare a :class:`~diepvries.deserializers.snowflake_deserializer.DatabaseConfiguration` to hold the Snowflake parameters. - We instantiate a :class:`~diepvries.deserializers.snowflake_deserializer.SnowflakeDeserializer`. This object will deserialize the given tables into Python objects, through Snowflake introspection. We gave as a parameter the list of target tables we're interested in, the 2 hubs and the 2 hub satellites. - We instantiate a :class:`~diepvries.data_vault_load.DataVaultLoad`. This is the main entry point to the framework. Let's look at its parameters: - ``extract_schema``: The schema where the raw data lives, ``dv_extract`` in our case. - ``extract_table``: The table that actually contains the raw data, the one we created earlier: ``order_customer``. - ``staging_schema``: The stating aread, used internally by the framework. We created ``dv_staging`` for that purpose. - ``staging_table``: The table used for storing staging data. It needs to be unique for each Data Vault script, in our case we chose the name ``order_customer``. - ``extract_start_timestamp``: The extract timestamp, i.e. now, the time when the data is loaded into the Data Vault. It needs to be timezone-aware. - ``target_tables``: The list of target tables. We computed it earlier with the Snowflake deserializer. - ``source``: The source of the data. After filling in the correct credentials for your Snowflake account, you can run this script. Be sure to execute it in the virtual environment where diepvries is installed, so it can be imported. Et voilĂ ! The Python script will print out all the SQL statements needed to load ``h_customer``, ``hs_customer``, ``h_order`` and ``hs_order`` with the data from ``dv_extract.order_customer``. After executing those, you should inspect the content of the tables: .. code-block:: sql SELECT * FROM dv.h_customer; SELECT * FROM dv.hs_customer; SELECT * FROM dv.h_order; SELECT * FROM dv.hs_order; Notice how ``h_customer`` has a row with the value ``dv_unknown`` for the column ``customer_id``. This is the ghost record, automatically created by diepvries due to a row in the input data having ``NULL`` as ``customer_id``. This happens for business IDs; other NULLable columns will retain ``NULL`` values. Most of the content of the Python script is unrelated to the data being loaded and the targeted Data Vault structures. This is the basic recipe you can use for loading data from multiple sources to multiple targets; with a few adaptations it can easily be configured to load all your data. How you organize and orchestrate multiple sources is up to you; diepvries provides the building blocks, allowing more flexibility into what you build with it. Next steps ---------- Hungry for more? Head over the :doc:`advanced topics ` or dive into the :doc:`API documentation `!