Microsoft updates its planet-scale Cosmos DB database service

INSUBCONTINENT EXCLUSIVE:
Cosmos DB is undoubtedly one of the most interesting products in Microsoft Azure portfolio
It a fully managed, globally distributed multi-model database that offers throughput guarantees, a number of different consistency models
and high read and write availability guarantees
Now that a mouthful, but basically, it means that developers can build a truly global product, write database updates to Cosmos DB and rest
assured that every other user across the world will see those updates within 20 milliseconds or so
And to write their applications, they can pretend that Cosmos DB is a SQL- or MongoDB-compatible database, for example. CosmosDB officially
launched in May 2017, though in many ways it an evolution of Microsoft existing Document DB product, which was far less flexible
Today, a lot of Microsoft own products run on CosmosDB, including the Azure Portal itself, as well as Skype, Office 365 and Xbox. Today,
Microsoft is extending Cosmos DB with the launch of its multi-master replication feature into general availability, as well as support for
the Cassandra API, giving developers yet another option to bring existing products to CosmosDB, which in this case are those written for
Cassandra. Microsoft now also promises 99.999 percent read and write availability
Previously, it read availability promise was 99.99 percent
And while that may not seem like a big difference, it does show that after more of a year of operating Cosmos DB with customers, Microsoft
now feels more confident that it a highly stable system
In addition, Microsoft is also updating its write latency SLA and now promises less than 10 milliseconds at the 99th percentile. If you
have write-heavy workloads, spanning multiple geos, and you need this near real-time ingest of your data, this becomes extremely attractive
for IoT, web, mobile gaming scenarios,& Microsoft CosmosDB architect and product manager Rimma Nehme told me
She also stressed that she believes Microsoft SLA definitions are far more stringent than those of its competitors. The highlight of the
update, though, is multi-master replication
&We believe that we&re really the first operational database out there in the marketplace that runs on such a scale and will enable globally
scalable multi-master available to the customers,& Nehme said
&The underlying protocols were designed to be multi-master from the very beginning. Why is this such a big deal With this, developers can
designate every region they run Cosmos DB in as a master in its own right, making for a far more scalable system in terms of being able to
write updates to the database
There no need to first write to a single master node, which may be far away, and then have that node push the update to every other region
Instead, applications can write to the nearest region, and Cosmos DB handles everything from there
If there are conflicts,the user can decide how those should be resolved based on their own needs. Nehme noted that all of this still plays
well with CosmosDB existing set of consistency models
If you don''t spend your days thinking about database consistency models, then this may sound arcane, but there a whole area of computer
science that focuses on little else but how to best handle a scenario where two users virtually simultaneously try to change the same cell
in a distributed database. Unlike other databases, Cosmos DB allows for a variety of consistency models, ranging from strong to eventual,
with three intermediary models
And it actually turns out that most CosmosDB users opt for one of those intermediary models. Interestingly, when I talked to Leslie Lamport,
the Turing award winner who developed some of the fundamental concepts behind these consistency models (and the popular LaTeX document
preparation system), he wasn''t all that sure that the developers are making the right choice
&I don''t know whether they really understand the consequences or whether their customers are going to be in for some surprises,& he told me
&If they&re smart, they are getting just the amount of consistency that they need
If they&re not smart, it means they&re trying to gain some efficiency and their users might not be happy about that.& He noted that when you
give up strong consistency, it often hard to understand what exactly is happening. But strong consistency comes with its drawbacks, too,
which leads to higher latency
&For strong consistency there are a certain number of roundtrip message delays that you can''t avoid,& Lamport noted. With Cosmos DB,
Microsoft wants to build one database to rule them all The CosmosDB team isn''t just building on some of the fundamental work Lamport did
around databases, but it also making extensive use of TLA+, the formal specification language Lamport developed in the late 90s
Microsoft, as well as Amazon and others, are now training their engineers to use TLA+ to describe their algorithms mathematically before
they implement them in whatever language they prefer. Because [CosmosDB is] a massively complicated system, there is no way to ensure the
correctness of it because we are humans, and trying to hold all of these failure conditions and the complexity in any one person — one
engineer — head, is impossible,& Microsoft Technical Follow Dharma Shukla noted
&TLA+ is huge in terms of getting the design done correctly, specified and validated using the TLA+ tools even before a single line of code
is written
You cover all of those hundreds of thousands of edge cases that can potentially lead to data loss or availability loss, or race conditions
that you had never thought about, but that two or three years ago after you have deployed the code can lead to some data corruption for
customers
That would be disastrous. Programming languages have a very precise goal, which is to be able to write code
And the thing that I&ve been saying over and over again is that programming is more than just coding,& Lamport added
&It not just coding, that the easy part of programming
The hard part of programming is getting the algorithms right. Lamport also noted that he deliberately chose to make TLA+ look like
mathematics, not like another programming languages
&It really forces people to think above the code level,& Lamport noted and added that engineers often tell him that it changes the way they
think. As for those companies that don''t use TLA+ or a similar methodology, Lamport says he worried
&I&m really comforted that [Microsoft] is using TLA+ because I don''t see how anyone could do it without using that kind of mathematical
thinking — and I worry about what the other systems that we wind up using built by other organizations — I worry about how reliable they
are.