Data Modeling Best Practices
Brian Beggs <email@example.com>
2019-08-01 - 2019/10/01
‘Original pull request’
In order to obtain the highest possible value from data collected in the Open edX ecosystem, this document attempts to provide guiding thoughts and principles on data modeling.
BI - Business intelligence. Technologies, applications and practices for the collection, integration, analysis, and presentation of business information. The purpose of Business Intelligence is to support better business decision making.
Composite Key - In the context of relational databases, a composite key is a combination of two or more columns in a table that can be used to uniquely identify each row in the table. Uniqueness is only guaranteed when the columns are combined; when taken individually, the columns do not guarantee uniqueness.
CUD - Create, Update, Delete. These are some of the actions that may happen to data in a database system.
CRUD - Create, Read, Update, Delete. Same as CUD but with the Read operation added.
Data Best Practices - These practices are designed to help teams create rich and efficient data models within the Open edX ecosystem. They are not standards but guidelines to help teams think about how to store data.
Data Dimension - A Data Dimension is a set of data attributes pertaining to something of interest to a business. Dimensions are things like “customers”, “products”, “stores” and “time”. For users of Data Warehouses, data dimensions are entry points to numeric facts (e.g. sale, profit, revenue) that a business wishes to monitor.
Data Modeling - A process used to define and analyze data requirements needed to support the business processes within the scope of corresponding information systems in organizations.
Data Modeling Standards - The most basic and standard design principles for data modeling. These standards must be adhered to when creating new models or updating existing models in the Open edX ecosystem.
IDA - Independently Deployed Application. Similar to a microservice or stand alone application.
Used in this document, the phrase “Open edX Ecosystem” includes the services that comprise and support the running of an Open edX installation (e.g.,. edx.org). This includes:
The LMS and Studio
IDAs such as Notes, Ecommerce, and Forums
Backend systems such as data analytics and operational logging
Third party services used to extend functionality of applications, such as Optimizely, Google Analytics, and Sailthru
This section is intended to give the reader a framework to think about data modeling at edX.
edX uses data to decide which people should receive marketing emails, who passes or fails a course, or how much to pay our partners. Our partners use our data to target users and to refine and improve their courses. The data we collect today is being used to advance academic research about online learning and pedagogy. Decision making at edX should be data-driven and based on this collected data.
This data is one of our most valuable assets and it should be a first-order concern. Save everything (disk is cheap). The more data we are able to collect about users, behavior, and system state now, the more opportunities we’ll have to improve our decision-making in the future.
Data stored in the OpenEdX ecosystem should adhere to industry best practices. For example, since an industry best practice uses numerical identifiers to identify rows of data, the Open edX ecosystem should also use numerical identifiers. Adhering to industry best practices and the practices outlined in this document ensures our data is approachable for experienced engineers and new hires alike and allows us to leverage 3rd party tools more easily to assist in the analysis phase.
Steve Krug, the author of “Don’t Make Me Think”, says: “Your objective should always be to eliminate instructions entirely by making everything self-explanatory, or as close to it as possible.” This principle should also be applied to data modeling. Data models should have descriptive names clearly identifying the data that the model holds. The field names should make sense to most people familiar with the domain, without needing much clarifying documentation.
For every hour of engineer-time spent creating a data model, many more are spent using the model and analyzing its data. Taking the time to consider how the data will be used, and thoughtfully design a data representation, can potentially save many hours during the analysis phase.
This is especially true since changing data models, once they are in use, can be time-consuming. A data model change in a core application can affect many other systems. Small changes to a data model could possibly cause hours or days of work for different teams throughout the organization. An example of the types of work that may occur while changing a model:
Data engineering and Analytics to update their workflows
BI teams to update reports that utilize this data,
- Devops to deploy migrations to large tables, with the risk of
Due to the potentially high cost of changing a model, it pays to get it as correct as possible the first time.
Think about the person who will analyze this data later and the person who manages the system day to day. (That person could possibly be future you!) Try to do as much careful design up front to make your quality of life better later.
Historical data accuracy is best when history about a change is captured at the database level. Changes in data relating to financial systems (e.g. enrollments, payments, course price changes) should be stored historically where the change is made, preferably in the same system as the system of record. This allows us to reconstruct the data at different points with much greater reliability.
If, for some reason, it is not possible to create history at the time of entry (for example, if we expect the table to become too large or the writes to be too frequent), it is recommended that an event be issued in its place.
The standards below are designed to ensure edX can gain the highest value and insights from the data. The application of these standards is the most basic level of support to which all applications in the Open edX ecosystem should adhere. When creating new applications or models please ensure the models being created conform to the following.
It is recommended to use BigAutoField.
Do not use composite based primary keys. Use a primary key column.
The preferred method for doing this in Open edX Django applications is to inherit the TimeStampedModel class.
Time should be stored in UTC time by setting USE_TZ=True in your python config.
If for some reason you can not inherit from TimeStampedModel use the following naming conventions:
Created date should be named: “created”
Updated date should be named: “modified”
Data should be joined using primary keys wherever possible
Foreign keys should use a naming convention of
<object_name>_idwhere object name is the name of the table of the foreign key relationship.
Do not join on things such as username, email address, or other dimensions of data that may change over time
Do not join on PII
Joining between IDAs should be done by using a universally unique identifier (UUID)
In Django use Attributes for fields with relations to identify and link models with relationships.
History for models involved with enrollments, courses and course metadata, or data involving financial payments and transactions#
History using django-simple-history.
Remember to backfill history for existing models.
Where Django simple history is not an option, the following data should be captured:
Fields that were changed
Date & time of the change
The foreign key of the user who initiated the change
Don’t use a
BigIntegerFieldfor your foreign keys
Don’t store an Integer field as CharField.
Store UUID’s as UUIDField with a max length that matches the max length of the UUID.
If a column could be a mix of integer data and character data it is best to store these items as 2 different fields in the database
A model should have default values that make sense for the application
This is to improve post processing. Defaulting CharField to null enables us to better derive the intent of the user. If the field is null no intent was made to enter that field. If the field is blank a string was entered and was modified later by the user.
For example if you are adding a boolean to flag that a learner has not yet activated their account, the default value should be set to False, not None.
If a model needs to preserve uniqueness between many fields use unique_together.
These practices are designed to help teams create rich and efficient data models within the OpenedX ecosystem. They are not standards but guidelines to help teams think about how they are storing data.
The name of a column in a table should accurately describe the data in that table.
If a column is named course_id it should only store the course_id. Not the course_key, not a numeric value, not a timestamp. Just the course_id. Conversely if a column is named course_run_key it should store the course run key, not the course_id.
It is better to have a column to mark the record as inactive than to remove the data from the system using the SQL delete keyword. These models should use Django’s SoftDeletableModel.
Please note that GDPR may require that data be deleted. If a field is determined to contain PII and falls under the realm of GDPR, that data should be deleted from the system, or obfuscated from the system. For more information about GDPR and how to delete user data from edx please refer to this documentation.
- Each piece of information should have its own column. Avoid storing
data in blob fields or as JSON in the database.
Another example is a concatenated string with a separator. It is best to treat these data items in 2 distinct fields.
- Don’t store encoded (pickle, json, other) objects in the database! If
you need to run the python environment to decode the data, analysts who use SQL will have a difficult time querying and decoding this data.
Storage is cheap!
If you are unsure whether you should store something in the database or add history the answer is almost always yes. Store the data and add history. It can be removed later if it is found unnecessary.
Still not sure? The default answer is yes.
CRUD operations should access models via methods on models (where they exist), instead of querying managers directly.
- For example, prefer creating something like
CourseEnrollment.is_enrolled(…) rather than having views check CourseEnrollment.objects.filter(…).exists().
- This allows us to more easily make internal representation changes
like switching to using a “deleted” flag instead of deleting the row.
- This also reduces the likelihood that people will query models in a
non-performant way (e.g. sorting by an unindexed field).
Don’t allow impossible states to be represented in the database.
- If your code expects a 1:1 relationship, use Django’s Unique instead of trying to enforce the constraint in
Python will not save you from race conditions. Database constraints will.
- For example, an enrollment should have a unique constraint on
(course_id, user_id), since a given user should only have one enrollment per course. In this case you should use Django’s unique_together.
Create indexes only on the fields necessary to make queries performant
Keep in mind that indexes cost space and have their own set of performance concerns.
Over-indexing data could actually make the database less performant (slower writes/updates)
It is the responsibility of the developer to adhere to all of the standards in the Data Modeling Standards section of this document.
The code reviewer is responsible for ensuring the standards set forth in the Data Modeling Standards section of this document are met.
Adhere to the same standards.
Data models that are not within the standards of this document do not need to be updated to adhere to OEP-38 standards.