PlanetScale & Vitess: Referential Integrity With Legacy Sharded Databases

18 April, 2022

7 minutes
1569 words

I love serverless technology. I play around and make lots of different serverless applications to experiment around with other cool technology. Within the huge cluster of technologies I use/experiment with, PlanetScale was the database that I primarily used for my personal side-projects, as there wasn't any other "good" option which the Prisma ORM supported.

PlanetScale is a MySQL serverless platform which simply sells Vitess, a database clustering system for the horizontal scaling of MySQL. They didn't write their own database - possibly contributed to it, but they didn't write it. From the Vitess documentation:

Vitess was created in 2010 to solve the MySQL scalability challenges that the team at YouTube faced.

In this article, we'll be moving towards understanding the structure of these non-ACID legacy sharded databases, why they're unable to support something as crucial as referential integrity, and why we should avoid using them in our applications. This article is more about the technology of Vitess, although I've included PlanetScale in the title because, as I mentioned above, it's just selling Vitess (with some tooling) as a service and they've gained traction in the following months as being a "reliable" serverless database.

#

Background

My initial question was why it says that it's impossible to scale a PlanetScale database with referential integrity as in their documentation it states that:

The way FOREIGN KEY constraints are implemented in MySQL (or, rather, in the InnoDB storage engine) interferes with Online DDL operations. Learn more in this Vitess blog post.

Limited to single MySQL server scope, FOREIGN KEY constraints are impossible to maintain once your data grows and is split over multiple database servers. This typically happens when you introduce functional partitioning/sharding and/or horizontal sharding.

This led me to think: do FOREIGN KEY constraints affect scalability in general? and if so, how?

I do think it's important to realise that SQL table joins are pretty costly, but to my knowledge it wasn't affected much by referential integrity? Now, if we're doing something like data analysis, obviously we don't have a need for referential integrity as we'd just want to dump our data into a single table, but PlanetScale and Vitess boast about being used by big web applications such as YouTube.

This led me to be confused as to why they'd drop the FOREIGN KEY constraint as databases such as CockroachDB and Spanner still maintain referential integrity along with being scalable.

#

What is referential integrity, and why is it important?

Let's start with the basics, in case you're new. I'm guessing most people reading this post have a fair idea of what they're talking about, but I'll explain as a formality. In simple words, a FOREIGN KEY constraint is a database key which we can use to create relations between two different tables by referencing a column, or a set of columns. Referential integrity simply refers to the state of the database in which all values of all keys are valid.

#

Why is it important?

Now that we have a bit of an idea on what they are, let's skip to the second part: why are they important?

Referential integrity is important as it keeps you from introducing new errors into your database. It is a feature often provided by relational databases preventing users or applications from entering inconsistent data into the database. This leads to improved data quality, faster development, much fewer bugs, and consistency across your application.

#

Why doesn't Vitess have it?

So, to understand why Vitess is unable to support referential integrity, we have to take a dive into the architecture of the database. Vitess is a sharded non-ACID SQL database, not a true distributed ACID SQL database.

Now, you must be wondering what those terms are. Let me break them down for you: ACID is an acronym of Atomicity, Consistency, Isolation and Durability.

Here, atomicity refers to an action either completing or failing entirely - no partial completion of a transaction. Consistency refers to the transaction leaving the database in a valid state. Isolation simply means that two transactions are executed without any interference with each other, and durability means the changes of the transaction are saved.

A shard is a horizontal partition of data in a database, and each shard is held on a separate database server instance to spread the load. So when we refer to a database which is sharded, we're talking about something like this. Now as I said earlier, Vitess is a sharded non-ACID SQL database, which basically means that it DOES NOT guarantee ACID properties of transactions.

#

Why drop it?

Well, the problem starts when you have a MySQL database with a well-defined schema, and your service becomes popular with the problem of too many reads hitting the database. What most people do here is they start caching frequently executed queries, but the reads are no longer ACIDic.

Along with too many reads, having an excess amount of writes to your database is a serious problem which many might face. Let's say we're ready to set fire to our pockets - we can vertically scale, adding more RAM, a 16-core processor and loads of really fast solid-state drives.

We of course still have the problem of SQL table joins increasing in complexity, so you start denormalising to avoid joins between tables.

I gave a talk at the Prisma Meetup a while ago, where I explained the fundamentals of designing a relational database. A topic I covered here was denormalisation, if you're interested, be sure to check this out.

But denormalisation is basically the process where you add redundant data to tables in your database, which improves performance on the cost of disk space as you're no longer using CPU power for joins. While denormalisation improves the speed of reads, it's important to realise that it does make writes slower.

Nevertheless, despite all of this, our database is still slow, so we move database computations onto the client, for instance generating a UUID or assigning a date.

Even after all of this, queries will still be slow - so we keep the result of the most queried data ready in a process known as database materialisation. Now reads might be faster, but writes are getting slower by the day. The only logical situation now is to drop secondary indexes.

So at this point, our database has

  • No ACID properties because of caching
  • No normalised schema
  • No triggers
  • No database computations
  • No secondary indexes

This paved the way for Vitess and NoSQL databases, as companies were having issues with scaling their database. The way that it was designed, they weren't able to maintain data consistency, an ACID property, when transactions spanned several different shards. Referential integrity is all about consistency when data spans across multiple shards, therefore it makes sense that they're unable to support it well.

We can go deep into the structure of NoSQL databases with no FOREIGN KEY constraint and issues that we'll face adopting that model, but that's the topic for another post.

It's not just Vitess, it's been a standard practice for sharded databases to avoid referential integrity as there is simply no other choice. In terms of the ACID model, their documentation states that they guarantee atomicity but not isolation, and even go as far as to say:

Guaranteeing ACID Isolation is very contentious and has high costs. Providing it by default would have made Vitess impractical for the most common use cases.

Let's briefly talk about what ACID Isolation is. There are four levels to it (according to the SQL-92 standards), including serialisability, read committed, read uncommitted, and repeatable reads. With that being said, there are more levels of isolation, such as Snapshot isolation which isn't a SQL standard although used by several databases such as Firebase or MongoDB. If you're further interested in this, I recommend reading this post. To keep it brief, I'm not going to go over what ever level of isolation does/means, but if you'd like to read more about that do check out this page from the MySQL Documentation.

ACID isolation refers to the database transactions being ACIDic, which is important as they guarantee that operations behave the way developers expect them to. I'm unsure on what they mean when they say "Guaranteeing ACID Isolation is very contentious and has high costs", but if they mean that guaranteeing ACID Isolation has high costs for any product, they're wrong. Several distributed ACID-compliant databases have the highest level of isolation (serialisable transactions) whilst still being performant with fast read/write speeds. In the context of Vitess though, they're not wrong as across multiple shards transactions cannot meet any level of isolation.

#

Conclusion

With all of this, you must be wondering: why would anybody want to use PlanetScale or Vitess? Well, I wonder the same. With many companies and websites, the reason was that they picked Vitess back when there weren't any better options. If you go to the beginning of the article, notice how it was created back in 2010. Now that we can enjoy an ACID-compliant scalable database with referential integrity, it would be in our best interest to shift towards these new databases, and I've started doing so already! Technology changes rapidly, and keeping your database up to speed is a crucial component of any application.

Harsh’s Newsletter

Get notified in your inbox whenever I publish a new video/talk, or write a new blog post. I won’t spam, and that’s a promise!

;