For: software architects, DBAs and backend developers
Databases get a lot of attention and they should. In a mutual relationship code gets more and more dependent on a database, and a database grows by demand of new functionality and code.
In order to be as resilient as possible to future changes, in RDBMS-land one should strive for getting the highest normal-state for their data. If you’ve already decided that you have to have an RDBMS to clear the job, take this advice: screw the good advice ‘never design for the future’. Because the future of your application will look very grim if you don’t carefully consider your data scheme (I assume considerable longevity and expansion of functionality on the application here).
So.. What’s this blog post about?
There is tight coupling involved between a database structure and an application. There’s no way around it. And that’s okay. But what I think is not okay, is setting rules about the relations more than once. This is prone for discrepancies in logic between the database and your codebase. There can be countless triggers, foreign key-constraints that block, cascade or set to null, and you wouldn’t know about them unless you perform your action and analyze the result, or you’ll have to mimic the database logic in your code (which will break in time due to the discrepancies).
Making the issue visible
Let’s first fire up a database
╭─tim@The-Incredible-Machine ~ ╰─➤ sudo apt-get install mysql-server-5.7
And populate it with some data. I’ve found this example database on github. If you also use it to learn from or with, please give the repo a star so people know they aren’t putting stuff out for nothing.
╭─tim@The-Incredible-Machine ~/Git ╰─➤ git clone git@github.com:datacharmer/test_db.git Cloning into 'test_db'... remote: Counting objects: 94, done. remote: Total 94 (delta 0), reused 0 (delta 0), pack-reused 94 Receiving objects: 100% (94/94), 68.80 MiB | 1.71 MiB/s, done. Resolving deltas: 100% (50/50), done. Checking connectivity... done. ╭─tim@The-Incredible-Machine ~/Git/test_db ‹master› ╰─➤ mysql -u root -p < employees.sql Enter password: INFO CREATING DATABASE STRUCTURE INFO storage engine: InnoDB INFO LOADING departments INFO LOADING employees INFO LOADING dept_emp INFO LOADING dept_manager INFO LOADING titles INFO LOADING salaries data_load_time_diff 00:01:02
In order to see what we’ve got, I reverse-engineered the database diagram from the database. This sounds harder than it actually is. Open MySQL Workbench, make sure you’ve established a connection with your running MySQL service, go to tools tab “database” and use the “reverse engineer” feature. Workbench will create a schema for you, a so called EER (Enhanced Entity Relationship Diagram).

EER Diagram of example database
We can basically see that all relations are cascading when a delete occurs, and restricting when an update occurs. So when an employee gets deleted, he or she will be removed from the department, will not be a manager anymore and all history of salaries and titles will be removed.
So now let’s look up one single department manager
mysql> select * from dept_manager limit 1; +--------+---------+------------+------------+ | emp_no | dept_no | from_date | to_date | +--------+---------+------------+------------+ | 110022 | d001 | 1985-01-01 | 1991-10-01 | +--------+---------+------------+------------+ 1 row in set (0,00 sec)
And ask the database to explain it’s plan upon deletion of the employee that we know is a manager.
mysql> explain delete from employees where emp_no = 110022; +----+-------------+-----------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-----------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+ | 1 | DELETE | employees | NULL | range | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using where | +----+-------------+-----------+------------+-------+---------------+---------+---------+-------+------+----------+-------------+ 1 row in set (0,00 sec)
So what is it I miss?
I am missing an Explain-like function, that doesn’t give me meta-info about the query-optimizer, but actual info on what would happen -in terms of relations- if I where to remove an entity from my database given a specific query.
I would expect that it would return data like:
- which table
- how many records will be affected
- what would be the operation (restrict, cascade, null)
And just like an explain can get extended, running this in an extended mode could also yield a group-concatenated list of primary keys, separated by comma and csv fanciness to not break the string (using string delimiters, escape characters, whatever you feel needed). Imagine the fancyness you could unleash with actually informing your user which record is the culprit.
Name me some examples where I need this!
Lets imagine we’ve created a CRM-like system. For now, let’s imagine we have
- organizations
- addresses
- invoices
- contacts
- notes
Some questions that might arrise are:
- Can I delete the organization, or is there still an unprocessed invoice attached (blocking constraint)? I’d rather not present the user the delete-button and give a reason why it cannot be done, then concluding this issue and rolling back on the action.
- When I delete an organization, what will go with it?
- Will processed invoices also be removed
(hope not! Better set to null in this case, the invoice should (besides the org.id) also store a copy of all relevant org data)? - Will unprocessed invoices be removed?
(Maybe best to block in this situation, since there is still money potential or reason to assume the user doesn’t want this) - will notes be removed?
(You could do this, but only if you are very verbose about it) - will contacts be removed?
(Often you don’t but since these kinds of relations are often many-to-many, these coupling-records should be removed)
- Will processed invoices also be removed
- What do I actually know about this relation? What’s exactly connected? Super nice to dynamically create graphs on how data is connected and how relevant this data actually is.
Why solve it in the database?
I have worked on two applications that counted each in excess of 150 tables that where all interconnected, and I can really say that investing significant effort in philosophizing on your database schema isn’t a luxury, it’s a must.
A database contains atomic values. It lays the foundation where your code starts to work with. That inherently means, that whatever we can do at that level to protect it, we should.
A couple of advantages of solving as much as possible in the database
- If multiple tools or applications connect to the database, they will all have to follow the same conventions
- whatever’s already in the database, doesn’t have to be transferred towards the database, and thus saves bandwidth.
- Foreign key constraints use indexed columns by nature. Your queries will run faster, since the data where your tables generally relate to each other are already indexed
- You don’t have to rely on super-transparent naming (you should do that anyhow by the way), but everyone that looks to your database will understand how tables relate to each other.
What’s next?
With regard to Database-related topics, there are a couple of topics that I’d like to cover in the future like:
- Caches
- microcaches, denormalizing data
- cache-renewal strategies
- Versioning of datastructures
- Query optimization
- do’s and don’ts on writing of queries
- methods to optimize
Do you think that I’m off, missing some recent progression in this area or just want to chat? Drop me a line and let me know what you think!