Beware of incremental database Ids
Back in school when we were learning database fundamentals, we learned about
keywords to build a database table. Known as data definition language. One thing
that stood out to me was the IDENTITY
column. It allows columns to auto
generate an integer each time a new record is inserted. For Microsoft's
Transact-SQL it looks like this:
IDENTITY(0,1)
This takes two arguments. The first is the seed, meaning the starting value, and
second is the increment amount. So the first inserted row will have Id
of 0
and adds 1 for each subsequent insert.
I think this is really convenient. There is no need to specify the Id
value
when inserting new row items. At the time I couldn't see any cases where this
would be bad, but now I have learned of some costs to this convenience. There
are some real world situations where using auto-generated incremental primary
keys are not such a good idea.
Danger #1
It becomes a challenge to insert new data when IDENTITY
columns need to be
managed between environments or databases.
Here's the problem. Each developer has a DEV environment with the same Id
values from the PROD branch. When two developers insert different new data, the
Id
values that are generated will be same. We don't know which feature will be
finished first so it's a race to production.
Let's see a concrete example. Given a table of VehicleType
.
# dbo.VehicleType
| Id | Description |
| -- | ----------- |
| 0 | car |
| 1 | boat |
In the DEV environment, both developers have this table. When one developer adds
a new vehicle: plane
, the new Id
is 2. When another developer adds the
vehicle motorcycle
the new Id
generated is also 2. The Id
values are the
same, but with different data. They can't both be 2 in production so the data
that goes to PROD first will claim the Id
. This makes development less
flexible because we don't know which developer will finish their feature and go
to PROD first. Things happen, and some features need to be delayed and their
releases postponed.
What makes this even worse is when the VehicleType.Id
is in hard-coded into
the code base. You probably know what I mean. There's always that Enum, Types,
or Constants or similarly suffixed file somewhere to be found in an enterprise
application(there's got to be a better way, maybe a future post on that). If you
lose the race to PROD, you have to change your code and test again.
// VehicleTypes.cs
public enum VehicleTypes
{
Car,
Boat,
Plane // or Motorcyle
}
This requires coordination, increases the risk of mistakes, and will cost the project more money when making changes. You now have to involve DBA and other developers to make sure you aren't stepping on each others' toes. Hopefully you aren't in separate teams, and hopefully those separate teams aren't in separate companies developing or maintaining that app at the same time.
Danger #2
It is possible to enumerate and scrape APIs, because IDENTITY
are incremental
and predictable.
https://api.cool-machines.com/vehicles/1
Do you see the number 1
at the end? Anyone can try a different number and see
if they can get another result. When exposing data in an web API, you probably
don't want to expose the IDENTITY
in the URL.
This is similar to what happened to Gravatar's recent breach. Being able to enumerate your data, even if it's public can be considered a breach.
Solution
Use UUID
by default. UUID
or GUID
are unique string of alpha-numeric
characters that are randomly generated. This protects from both dangers because
you can't guess the next number and you can't accidentally generate a the same
one. This will allow developers to insert Plane
and Motorcycle
in any order
without worrying about clashing in the database.
In the DB switch the column type to uniqueidentifier
or varchar(36)
. In the
C# code, we can't just use GUID
as enum
so we have to switch the enum
into
a class
like this:
// VehicleTypes.cs
public static class VehicleTypes
{
public const Guid Car = new Guid("...");
public const Guid Boat = new Guid("...");
// add in any order, git can merge this easy
public const Guid Plane = new Guid("...");
public const Guid Motorcycle = new Guid("...");
}
This also fixes the enumeration and scraping issue. You could even expose the UUID in the URL if you want. People could try to plop UUIDs at the end of the URL but the collision probability is so low that it's not worth the time.
API Urls will now look like this:
https://api.cool-machines.com/vehicles/bdc8971c-7d8d-476f-9a4b-4c45d1fb7ea2
Conclusion
I think it's safe to avoid using IDENTITY
columns and use UUID
by default. I
have yet to come across a use-case where the benefits of IDENTITY
are better
than UUID
but will keep my mind open. The one downside of using UUID
is
typing them out when debugging. Please let me know if you have a different
opinion because I could be completely wrong here.
Thanks for reading!
Disclaimer
In my examples above, the dangers exist because of a multiple factors. It is not
that IDENTITY
should not be used, but in order to accomodate the certain
development lifecycles, I have witnessed some real world inconveniences. As for
enumeration vulnerabilities, it is not exclusive to IDENTITY
columns. With the
proper security, it's fine to expose sequential numeric Id
values. Designing
Web API endpoints requires careful consideration on who will be consuming it.