OEP-38: Data Modeling Best Practices

OEP

OEP-0038

Title

Data Modeling Best Practices

Last Modified

2019-09-19

Author

Brian Beggs <bbeggs@edx.org>

Arbiter

Albemarle

Status

Accepted

Type

Best Practices

Created

2019-05-01

Review Period

2019-08-01 - 2019/10/01

Resolution

‘Original pull request’

Abstract

In order to obtain the highest possible value from data collected in the Open edX ecosystem, this document attempts to provide guiding thoughts and principles on data modeling.

What this OEP explicitly does not cover

  • Data warehousing

  • Data pipelines

  • Event data

Definitions

BI - Business intelligence. Technologies, applications and practices for the collection, integration, analysis, and presentation of business information. The purpose of Business Intelligence is to support better business decision making.

Composite Key - In the context of relational databases, a composite key is a combination of two or more columns in a table that can be used to uniquely identify each row in the table. Uniqueness is only guaranteed when the columns are combined; when taken individually, the columns do not guarantee uniqueness.

CUD - Create, Update, Delete. These are some of the actions that may happen to data in a database system.

CRUD - Create, Read, Update, Delete. Same as CUD but with the Read operation added.

Data Best Practices - These practices are designed to help teams create rich and efficient data models within the Open edX ecosystem. They are not standards but guidelines to help teams think about how to store data.

Data Dimension - A Data Dimension is a set of data attributes pertaining to something of interest to a business. Dimensions are things like “customers”, “products”, “stores” and “time”. For users of Data Warehouses, data dimensions are entry points to numeric facts (e.g. sale, profit, revenue) that a business wishes to monitor.

Data Modeling - A process used to define and analyze data requirements needed to support the business processes within the scope of corresponding information systems in organizations.

Data Modeling Standards - The most basic and standard design principles for data modeling. These standards must be adhered to when creating new models or updating existing models in the Open edX ecosystem.

IDA - Independently Deployed Application. Similar to a microservice or stand alone application.

Open edX Ecosystem

Used in this document, the phrase “Open edX Ecosystem” includes the services that comprise and support the running of an Open edX installation (e.g.,. edx.org). This includes:

  • The LMS and Studio

  • IDAs such as Notes, Ecommerce, and Forums

  • Backend systems such as data analytics and operational logging

  • Third party services used to extend functionality of applications, such as Optimizely, Google Analytics, and Sailthru

Specification

Guiding Principles

This section is intended to give the reader a framework to think about data modeling at edX.

Data is an asset

edX uses data to decide which people should receive marketing emails, who passes or fails a course, or how much to pay our partners. Our partners use our data to target users and to refine and improve their courses. The data we collect today is being used to advance academic research about online learning and pedagogy. Decision making at edX should be data-driven and based on this collected data.

This data is one of our most valuable assets and it should be a first-order concern. Save everything (disk is cheap). The more data we are able to collect about users, behavior, and system state now, the more opportunities we’ll have to improve our decision-making in the future.

Use industry best practices

Data stored in the OpenEdX ecosystem should adhere to industry best practices. For example, since an industry best practice uses numerical identifiers to identify rows of data, the Open edX ecosystem should also use numerical identifiers. Adhering to industry best practices and the practices outlined in this document ensures our data is approachable for experienced engineers and new hires alike and allows us to leverage 3rd party tools more easily to assist in the analysis phase.

Don’t make me think

Steve Krug, the author of “Don’t Make Me Think”, says: “Your objective should always be to eliminate instructions entirely by making everything self-explanatory, or as close to it as possible.” This principle should also be applied to data modeling. Data models should have descriptive names clearly identifying the data that the model holds. The field names should make sense to most people familiar with the domain, without needing much clarifying documentation.

Consider the analytics user

For every hour of engineer-time spent creating a data model, many more are spent using the model and analyzing its data. Taking the time to consider how the data will be used, and thoughtfully design a data representation, can potentially save many hours during the analysis phase.

This is especially true since changing data models, once they are in use, can be time-consuming. A data model change in a core application can affect many other systems. Small changes to a data model could possibly cause hours or days of work for different teams throughout the organization. An example of the types of work that may occur while changing a model:

  • Data engineering and Analytics to update their workflows

  • BI teams to update reports that utilize this data,

  • Devops to deploy migrations to large tables, with the risk of

    potential downtime.

Due to the potentially high cost of changing a model, it pays to get it as correct as possible the first time.

Think about the person who will analyze this data later and the person who manages the system day to day. (That person could possibly be future you!) Try to do as much careful design up front to make your quality of life better later.

History is best (but events are a close second)

Historical data accuracy is best when history about a change is captured at the database level. Changes in data relating to financial systems (e.g. enrollments, payments, course price changes) should be stored historically where the change is made, preferably in the same system as the system of record. This allows us to reconstruct the data at different points with much greater reliability.

If, for some reason, it is not possible to create history at the time of entry (for example, if we expect the table to become too large or the writes to be too frequent), it is recommended that an event be issued in its place.

Data Modeling Standards

The standards below are designed to ensure edX can gain the highest value and insights from the data. The application of these standards is the most basic level of support to which all applications in the Open edX ecosystem should adhere. When creating new applications or models please ensure the models being created conform to the following.

Use a numeric primary key

  • Do not use composite based primary keys. Use a primary key column.

Have updated and created timestamps

  • The preferred method for doing this in Open edX Django applications is to inherit the TimeStampedModel class.

  • Time should be stored in UTC time by setting USE_TZ=True in your python config.

  • If for some reason you can not inherit from TimeStampedModel use the following naming conventions:

    • Created date should be named: “created”

    • Updated date should be named: “modified”

Foreign keys

  • Data should be joined using primary keys wherever possible

  • Foreign keys should use a naming convention of <object_name>_id where object name is the name of the table of the foreign key relationship.

  • Do not join on things such as username, email address, or other dimensions of data that may change over time

  • Do not join on PII

  • Joining between IDAs should be done by using a universally unique identifier (UUID)

  • In Django use Attributes for fields with relations to identify and link models with relationships.

History for models involved with enrollments, courses and course metadata, or data involving financial payments and transactions

  • Where Django simple history is not an option, the following data should be captured:

    • Fields that were changed

    • Date & time of the change

    • The foreign key of the user who initiated the change

Use the correct data type for a column

  • Don’t use a IntegerField when a BooleanField would do.

  • Use BigIntegerField for your foreign keys

  • Don’t store an Integer field as CharField.

  • Store UUID’s as UUIDField with a max length that matches the max length of the UUID.

Each column in a table should only store a single fact or dimension

  • If a column could be a mix of integer data and character data it is best to store these items as 2 different fields in the database

Models should be Annotated for PII

Have sane default values

  • A model should have default values that make sense for the application

  • CharField should be defined with the null=True option.

  • This is to improve post processing. Defaulting CharField to null enables us to better derive the intent of the user. If the field is null no intent was made to enter that field. If the field is blank a string was entered and was modified later by the user.

  • For example if you are adding a boolean to flag that a learner has not yet activated their account, the default value should be set to False, not None.

The database layer should preserve uniqueness

  • If a model needs to preserve uniqueness between many fields use unique_together.

Best Practices

These practices are designed to help teams create rich and efficient data models within the OpenedX ecosystem. They are not standards but guidelines to help teams think about how they are storing data.

Columns should store only exactly what the name describes

  • The name of a column in a table should accurately describe the data in that table.

  • If a column is named course_id it should only store the course_id. Not the course_key, not a numeric value, not a timestamp. Just the course_id. Conversely if a column is named course_run_key it should store the course run key, not the course_id.

Deleting data

  • It is better to have a column to mark the record as inactive than to remove the data from the system using the SQL delete keyword. These models should use Django’s SoftDeletableModel.

Don’t trap the data

  • Each piece of information should have its own column. Avoid storing

    data in blob fields or as JSON in the database.

  • Another example is a concatenated string with a separator. It is best to treat these data items in 2 distinct fields.

  • Don’t store encoded (pickle, json, other) objects in the database! If

    you need to run the python environment to decode the data, analysts who use SQL will have a difficult time querying and decoding this data.

Store everything

  • Storage is cheap!

  • If you are unsure whether you should store something in the database or add history the answer is almost always yes. Store the data and add history. It can be removed later if it is found unnecessary.

  • Still not sure? The default answer is yes.

Use methods on models to access data

  • CRUD operations should access models via methods on models (where they exist), instead of querying managers directly.

  • For example, prefer creating something like

    CourseEnrollment.is_enrolled(…) rather than having views check CourseEnrollment.objects.filter(…).exists().

  • This allows us to more easily make internal representation changes

    like switching to using a “deleted” flag instead of deleting the row.

  • This also reduces the likelihood that people will query models in a

    non-performant way (e.g. sorting by an unindexed field).

Enforce logical constraints at the database layer

  • Don’t allow impossible states to be represented in the database.

  • If your code expects a 1:1 relationship, use Django’s Unique instead of trying to enforce the constraint in

    Python.

  • Python will not save you from race conditions. Database constraints will.

  • For example, an enrollment should have a unique constraint on

    (course_id, user_id), since a given user should only have one enrollment per course. In this case you should use Django’s unique_together.

Keep indexes to a minimum

  • Create indexes only on the fields necessary to make queries performant

  • Keep in mind that indexes cost space and have their own set of performance concerns.

  • Over-indexing data could actually make the database less performant (slower writes/updates)

Developer Responsibility

It is the responsibility of the developer to adhere to all of the standards in the Data Modeling Standards section of this document.

Code Reviewer Responsibility

The code reviewer is responsible for ensuring the standards set forth in the Data Modeling Standards section of this document are met.

Responsibility for Third-party Service Integrations

Adhere to the same standards.

Backward Compatibility

Data models that are not within the standards of this document do not need to be updated to adhere to OEP-38 standards.