UCONN Stamford Google Cloud Development Platform: Chapter 5: Document Storage

Chapter 5: Document Storage

What’s document storage?
What’s Cloud Datastore?
Interacting with Cloud Datastore
Deciding whether Cloud Datastore is a good fit
Key distinctions between hosted and managed services

Document storage is a form of non relational storage that happens to be different conceptually from the relational databases

Thinking of tables containing rows and keeping all of your data in a rectangular grid

Document databases think in terms of collections and documents.

Arbitrary sets of key-value pairs.

Must have in common is the document type

Document database, you might have an Employees collection, which might contain two documents:

{"id": 1, "name": "James Bond"}

{"id": 2, "name": "Ian Fleming", "favoriteColor": "blue"}

Traditional table of similar data

Table 5.1. Grid of employee records

ID	Name	Favorite color
1	"James Bond"	Null
2	"Ian Fleming"	"blue"

Table 5.2. Jagged collection of employees

Key	Data
1	{id: 1, name: "James Bond"}
2	{id: 2, name: "Ian Fleming", favoriteColor: "blue"}

SELECT * FROM Employees WHERE favoriteColor != "blue"

In some document storage systems the answer to this query is an empty set reason is that a missing property isn’t the same thing as a property with

a null value,

Only documents considered are those that explicitly have a key called favoriteColor.

Systems were designed with a focus on large-scale storage.

That all queries were consistently fast, the designers had to trade away advanced features like joining related data.

Things like lookups by a single key and simple scans through the data, but nowhere near as full-featured as a traditional SQL database.

5.1. What’s Cloud Datastore?

Cloud Datastore is a highly scalable NoSQL database for your applications.

Cloud Datastore automatically handles sharding and replication, providing you with a highly available and durable database that scales automatically to handle your applications load.

Cloud Datastore provides a myriad of capabilities such as ACID transactions, SQL-like queries, indexes, and much more.

Firestore in Datastore mode (Datastore) is a NoSQL document database built for automatic scaling, high performance, and ease of application development. Datastore features include:

Atomic transactions. Datastore can execute a set of operations where either all succeed, or none occur.

High availability of reads and writes. Datastore runs in Google data centers, which use redundancy to minimize impact from points of failure.

Massive scalability with high performance. Datastore uses a distributed architecture to automatically manage scaling.

Flexible storage and querying of data. Datastore maps naturally to object-oriented and scripting languages, and is exposed to applications through multiple clients. It also provides a SQL-like query language.

Strong consistency. Datastore ensures that all queries are strongly consistent.

Atomicity - Changes to data performed as if they are a single operation.

Consistency - Consistent state when transaction starts and when ends.

Isolation - Transaction is invisible to other transactions. As a result, transactions that run concurrently appear to be serialized.

Durability - After a transaction successfully completes, changes to data persist and are not undone, even in the event of a system failure.

(Sharding is a type of database partitioning that separates very large databases into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.

Fast and highly Scalable

Focus on building your applications without worrying about provisioning and load anticipation. Cloud Datastore scales seamlessly and automatically with your data,allowing applications to maintain high performance as they receive more traffic.

Diverse data types

Cloud Datastore supports a variety of data types, including integers, floating-point numbers, strings, dates, and binary data, among others.

First launched as the default way to store data in Google App Engine.

Designed to handle large-scale data.

5.1.1. Design goals for Cloud Datastore

Large-scale storage system makes for a great example: Gmail

Data locality

Mail database would need to store all email for all accounts, you wouldn’t need to search across multiple accounts.

Concept of where to put data is called data locality.

Datastore is designed in a way that allows you to choose which documents live near other documents by putting them in the same entity group.

Result-set query scale

Frustrating if your inbox got slower as you receive more email.

Index emails as they arrive so that when you want to search your inbox,

Would be proportional only to the number of matching emails.

Same amount of time regardless of whether you have 1 GB or 1 PB of email data.

Automatic replication

The fact that sometimes servers die, disks fail, and networks go down.

Email data in lots of places so it’s always available.

Data written should be replicated automatically to many physical servers.

Email is never on a single computer with a single hard drive. Instead, each email is distributed across lots of places.

5.1.2. Concepts

How they fit together

Keys

The idea of a key, which is what Cloud Datastore uses to represent a unique identifier for anything that has been stored.

Relational database unique ID is the first column in tables.

Datastore keys have two differences from table IDs.

Doesn’t have an identical concept of tables.

Keys contain the type of the data and the unique identifier.

Employee data in MySQL, create a table called employees.

Column in that table is id that’s a unique integer.

Insert some data where the key is Employee:1.

The type of the data here (Employee) is referred to as the kind.

Two keys have the same parent, they’re in the same entity group.

Parent keys are how you tell Datastore to put data near other data. (Give them the same parent!)

Keys can refer to multiple kinds in their path or the hierarchy, and the kind (type) of the data is the kind of the bottom-most piece.

Hierarchy

Store your employee records as children of the company they work for, Company:1:Employee:2

The kind of this key is Employee

The parent key is Company:1 (whose kind is Company)

Key refers to employee #2, and because of its parent (Company:1

Stored near all other employees of the same company; for example,

Company:1:Employee:44 will be nearby.

Can specify keys as strings, such as Company:1:Employee:jbond or Company:apple.com:Employee:stevejobs.

Entities

Primary storage concept in Cloud Datastore is an entity

An entity is nothing more than a collection of properties and values combined with a unique identifier called a key

An entity can have properties of all the basics, also known as primitives

Booleans (true or false)
Strings (“James Bond”)
Integers (14)
Floating-point numbers (3.4)
Dates or times (2013-05-14T00:01:00.234Z)
Binary data (0x0401)

{

"key": "Company:apple.com:Employee:jonyive",

"name": "Jony Ive",

"likesDesign": true,

"pets": 3

}

Datastore exposes some more advanced types

Lists, which allow you to have a list of strings
Keys, which point to other entities
Embedded entities, which act as subentities

{

"key": "Company:apple.com:Employee:jonyive",

"manager": "Company:apple.com:Employee:stevejobs",

"groups": ["design", "executives"],

"team": {

"name": "Design Executives",

"email": "design@apple.com"

}

Reference to another key is as close as you can get to the concept of foreign keys.

In the context of relational databases, a foreign key is a field (or collection of fields)
in one table that uniquely identifies a row of another table or the same table.

Lists of values typically aren’t supported in relational databases

In Datastore, if that structured data doesn’t need its own row in a table, you can embed that data directly inside another entity using embedded entities.

Comparison with relational databases

While the Datastore interface has many of the same features similar to relational databases, as a NoSQL database, it varies in how it describes the relationships between data objects. Here's a high-level comparison of Datastore and relational database concepts:

Concept	Datastore	Firestore	Relational database
Category of object	Kind	Collection group	Table
One object	Entity	Document	Row
Individual data for an object	Property	Field	Column
Unique ID for an object	Key	Document ID	Primary key

What it's good for

Datastore is ideal for applications that rely on highly available structured data at scale. You can use Datastore to store and query all of the following types of data:

Product catalogs that provide real-time inventory and product details for a retailer.
User profiles that deliver a customized experience based on the

Operations

Things you can do to an entity. The basic operations are

get—Retrieve an entity by its key.
put—Save or update an entity by its key.
delete—Delete an entity by its key.

All of these operations require the key for the entity

Omit the ID portion of the key in a put operation, Datastore will generate one automatically for you.

Indexes and queries

Typical database, a query is nothing more than a SQL statement, such as

SELECT * FROM employees.

Using GQL (a query language much like SQL).

Datastore uses indexes to make a query possible (table 5.3).

Table 5.3. Queries and indexes, relational vs Datastore

Feature	Relational	Datastore
Query	SQL, with joins	GQL, no joins; certain queries impossible
Index	Makes queries faster	Makes advanced query possible

Creating an index that stays up to date whenever information changes and that you can scan through to find matching emails.

Index is nothing more than a specially ordered and maintained data set to make querying fast.

Table 5.4. An index over the sender field

Sender	Key
eric@google.com	GmailAccount:me@gmail.com:Email:8495
steve@apple.com	GmailAccount:me@gmail.com:Email:2441

Index pulls out the sender field from emails and allows you to query over all emails with a certain sender value.

When the query finishes, all matching results have been found.

5.1.3. Consistency and replication

Distributed storage system for something like Gmail needs to meet two key requirements:

To be always available and to scale with the result set.

One protocol that Cloud Datastore happens to use involves something called a two-phase commit.

You break the changes you want saved into two phases: a preparation phase and a commit phase.

Preparation phase, you send a request to a set of replicas, describing a change and asking the replicas to get ready to apply it.

Confirm that they’ve prepared the change, you send a second request instructing all replicas to apply that change.

This second (commit) phase is done asynchronously, where some of those changes may hang around in the prepared but not yet applied state.

Arrangement leads to eventual consistency when running broad queries where the entity or the index entry may be out of date.

First push a replica to execute any pending commits of the resource and then run the query, resulting in a strongly consistent result.

Maintaining entities and indexes in a distributed system is a much more complicated task.

Datastore would have two options:

Update the entity and the indexes everywhere synchronously, confirming the operation will take an unreasonably long time.

Update the entity itself and the indexes in the background, keeping request latency much lower.

Datastore chose to update data asynchronously to make sure that no matter how many indexes you add, the time it takes to save an entity is the same.

Create or update the entity.
Determine which indexes need to change as well.
Tell the replicas to prepare for the change.
Ask the replicas to apply the change when they can.

Ensure all pending changes to the affected entity group are applied.
Execute the query.

Datastore uses these indexes to make sure your query runs in time that’s proportional to the number of matching results found.

Send the query to Datastore.
Search the indexes for matching keys.
For each matching result, get the entity by its key in an eventually consistent way.
Return the matching entities.

The indexes are updated in the background, so there’s no real guarantee regarding when the indexes will be updated.

Eventual consistency, which means that eventually your indexes will be up to date (consistent) with the data you have stored in your entities.

Listing 5.1. Example Employee entity

{

"key": "Employee:1",

"name": "James Bond",

"favoriteColor": "blue"

}

SELECT * FROM Employee WHERE favoriteColor = "blue"

If the indexes haven’t been updated yet (they will eventually), you won’t get this employee back in the result.

Eventually consistent, specifically because the indexes that Datastore uses to find those entities are updated in the background.

Change this employee’s favorite color

Data is written to Datastore in objects known as entities. Each entity has a key that uniquely identifies it. An entity can optionally designate another entity as its parent; the first entity is a child of the parent entity.

Listing 5.2. Employee entity with a different favorite color

{

"key": "Employee:1",

"name": "James Bond",

"favoriteColor": "red"

}

You may see different results

5.1.4. Consistency with data locality

Data locality as a tool for putting many pieces of data near each other.

Eventual consistency (that your queries run over indexes rather than your data, and those indexes are eventually updated in the background).

entity group, defined by keys sharing the same parent key

Query over a bunch of entities that all have the same parent key, your query will be strongly consistent.

Two places where Datastore shines are durability and throughput,

5.5.1. Structure

Cloud Datastore excels at managing semistructured data where attributes have types.

No single schema across all entities (or documents) of the same kind.

Datastore also allows you to express the locality of your data using hierarchical keys (where one key is prefixed with the key of its parent).

Segment data between units of isolation.

Desire to segment data between units of isolation.

Enables automatic replication of your data, is what allows it to be so highly available as a storage system.

Queries across all the data will be eventually consistent.

5.5.2. Query complexity

Nonrelational storage system.

Cloud Datastore doesn’t support the typical relational aspects (for example, the JOIN operator).

Allow you to store keys that act as pointers to other stored entities,

No referential integrity and no ability to cascade or limit changes involving referenced entities.

Referential integrity refers to the accuracy and consistency of data within a relationship.

In relationships, data is linked between two or more tables . This is achieved by having the foreign key (in the associated table) reference a primary key value (in the primary – or parent – table).

Delete an entity in Cloud Datastore, anywhere you pointed to that entity from elsewhere becomes an invalid reference.

Certain queries require that you have indexes to enable them.

Limitations are the consequence of the structural requirements that went into designing Cloud Datastore, whereas other limitations enable consistent performance for all queries.

5.5.3. Durability

Because Megastore was built on the premise that you can never lose data, everything is automatically replicated and not considered saved until saved in several places.

Datastore handles this entirely on its own, meaning that the only setting for durability is as high as possible.

Global queries being only eventually consistent.

Data needs to replicate to several places before being called saved.

5.5.4. Speed (latency)

Compared to many in-memory storage systems.

Cloud Datastore won’t be as fast for the simple reason that even SSDs are slower than RAM.

Relational database system like PostgreSQL or MySQL, Cloud Datastore will be in the same ballpark.

As your SQL database gets larger or receives more requests at the same time,

it’ll likely get slower.

Cloud Datastore’s latency stays the same regardless of the level of concurrency.

Cloud Datastore certainly won’t be blazing fast like in-memory NoSQL storage systems, but it’ll be on par with other relational databases .

5.5.5. Throughput

Accommodate as much traffic as you care to throw at it.

The pessimistic locking that comes with relational databases like MySQL doesn’t apply.

Able to scale up to many concurrent write operations.

Adding more servers on Google’s side to keep up.

5.5.7. Overall

To-Do ListTable 5.8. To-Do List application storage needs

Aspect	Needs	Good fit?
Structure	Structure is fine, not necessary though.	Sure
Query complexity	We don’t have that many fancy queries.	Definitely
Durability	High—We don’t want to lose stuff.	Definitely
Speed	Not a lot.	Definitely
Throughput	Not a lot.	Sure
Cost	Lower is better for toy projects.	Definitely

Bit of overkill on the scalability side.

Snapagram

Summary

Document storage keeps data organized as heterogeneous (jagged) documents rather than homogeneous rows in a table.
Using document storage effectively may involve duplicating data for easy access (denormalizing).
Document storage is great for storing data that may grow to huge sizes and experience huge amounts of traffic, but it comes at the cost of not being able to do fancy queries (for example, joins that you do in SQL).
Cloud Datastore is a fully managed storage system with automatic replication, result-set query scale, full transactional semantics, and automatic scaling.
Cloud Datastore is a good fit if you need high scalability and have relatively straightforward queries.
Cloud Datastore charges for operations on entities, meaning the more data you interact with, the more you pay.

UCONN

Chapter 5: Document Storage

{"id": 1, "name": "James Bond"}

{"id": 2, "name": "Ian Fleming", "favoriteColor": "blue"}

Traditional table of similar data

Table 5.1. Grid of employee records

Table 5.2. Jagged collection of employees

SELECT * FROM Employees WHERE favoriteColor != "blue"

In some document storage systems the answer to this query is an empty set reason is that a missing property isn’t the same thing as a property with

a null value,

Only documents considered are those that explicitly have a key called favoriteColor.

Systems were designed with a focus on large-scale storage.

That all queries were consistently fast, the designers had to trade away advanced features like joining related data.

Things like lookups by a single key and simple scans through the data, but nowhere near as full-featured as a traditional SQL database.

5.1. What’s Cloud Datastore?

Cloud Datastore is a highly scalable NoSQL database for your applications.

Cloud Datastore automatically handles sharding and replication, providing you with a highly available and durable database that scales automatically to handle your applications load.

Cloud Datastore provides a myriad of capabilities such as ACID transactions, SQL-like queries, indexes, and much more.

(Sharding is a type of database partitioning that separates very large databases into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.

Fast and highly Scalable

Focus on building your applications without worrying about provisioning and load anticipation. Cloud Datastore scales seamlessly and automatically with your data,allowing applications to maintain high performance as they receive more traffic.

Diverse data types

Cloud Datastore supports a variety of data types, including integers, floating-point numbers, strings, dates, and binary data, among others.

First launched as the default way to store data in Google App Engine.

Designed to handle large-scale data.

5.1.1. Design goals for Cloud Datastore

Large-scale storage system makes for a great example: Gmail

Data locality

Mail database would need to store all email for all accounts, you wouldn’t need to search across multiple accounts.

Concept of where to put data is called data locality.

Datastore is designed in a way that allows you to choose which documents live near other documents by putting them in the same entity group.

Result-set query scale

Frustrating if your inbox got slower as you receive more email.

Index emails as they arrive so that when you want to search your inbox,

Would be proportional only to the number of matching emails.

Same amount of time regardless of whether you have 1 GB or 1 PB of email data.

Automatic replication

The fact that sometimes servers die, disks fail, and networks go down.

Email data in lots of places so it’s always available.

Data written should be replicated automatically to many physical servers.

Email is never on a single computer with a single hard drive. Instead, each email is distributed across lots of places.

5.1.2. Concepts

How they fit together

Keys

The idea of a key, which is what Cloud Datastore uses to represent a unique identifier for anything that has been stored.

Relational database unique ID is the first column in tables.

Datastore keys have two differences from table IDs.

Doesn’t have an identical concept of tables.

Keys contain the type of the data and the unique identifier.

Employee data in MySQL, create a table called employees.

Column in that table is id that’s a unique integer.

Insert some data where the key is Employee:1.

The type of the data here (Employee) is referred to as the kind.

Two keys have the same parent, they’re in the same entity group.

Parent keys are how you tell Datastore to put data near other data. (Give them the same parent!)

Keys can refer to multiple kinds in their path or the hierarchy, and the kind (type) of the data is the kind of the bottom-most piece.

Hierarchy

Store your employee records as children of the company they work for, Company:1:Employee:2

The kind of this key is Employee

The parent key is Company:1 (whose kind is Company)

Key refers to employee #2, and because of its parent (Company:1

Stored near all other employees of the same company; for example,

Company:1:Employee:44 will be nearby.

Can specify keys as strings, such as Company:1:Employee:jbond or Company:apple.com:Employee:stevejobs.

Entities

Primary storage concept in Cloud Datastore is an entity

An entity is nothing more than a collection of properties and values combined with a unique identifier called a key

An entity can have properties of all the basics, also known as primitives

{

"__key__": "Company:apple.com:Employee:jonyive",

"name": "Jony Ive",

"likesDesign": true,

"pets": 3

}

Datastore exposes some more advanced types

{

"__key__": "Company:apple.com:Employee:jonyive",

"manager": "Company:apple.com:Employee:stevejobs",

"groups": ["design", "executives"],

"team": {

"key": "Company:apple.com:Employee:jonyive",

"key": "Company:apple.com:Employee:jonyive",