UCONN

UCONN
UCONN

Chapter 5: Document Storage

Chapter 5: Document Storage


  • What’s document storage?

  • What’s Cloud Datastore?

  • Interacting with Cloud Datastore

  • Deciding whether Cloud Datastore is a good fit

  • Key distinctions between hosted and managed services


Document storage is a form of non relational storage that happens to be different conceptually from the relational databases.


Thinking of tables containing rows and keeping all of your data in a rectangular grid.


Document databases think in terms of collections and documents.


Arbitrary sets of key-value pairs.


Must have in common is the document type.


Document database, you might have an Employees collection, which might contain two documents:



{"id": 1, "name": "James Bond"}

{"id": 2, "name": "Ian Fleming", "favoriteColor": "blue"}


Traditional table of similar data


Table 5.1. Grid of employee records

ID

Name

Favorite color

1

"James Bond"

Null

2

"Ian Fleming"

"blue"


Table 5.2. Jagged collection of employees

Key

Data

1

{id: 1, name: "James Bond"}

2

{id: 2, name: "Ian Fleming", favoriteColor: "blue"}


SELECT * FROM Employees WHERE favoriteColor != "blue"


In some document storage systems the answer to this query is an empty set reason is that a missing property isn’t the same thing as a property with

a null value.


Only documents considered are those that explicitly have a key called favoriteColor.


That all queries were consistently fast, the designers had to trade away advanced features like joining related data.

5.1. What’s Cloud Datastore?

Cloud Datastore is a highly scalable NoSQL database for your applications.


Cloud Datastore automatically handles sharding and replication, providing you with a highly available and durable database that scales automatically to handle your applications load.


Cloud Datastore provides a myriad of capabilities such as ACID transactions, SQL-like queries, indexes, and much more.


Firestore in Datastore mode (Datastore) is a NoSQL document database built for automatic scaling, high performance, and ease of application development. Datastore features include:


Atomic transactions. Datastore can execute a set of operations where either all succeed, or none occur.


High availability of reads and writes. Datastore runs in Google data centers, which use redundancy to minimize impact from points of failure.


Massive scalability with high performance. Datastore uses a distributed architecture to automatically manage scaling.


Flexible storage and querying of data. Datastore maps naturally to object-oriented and scripting languages, and is exposed to applications through multiple clients. It also provides a SQL-like query language.


Strong consistency. Datastore ensures that all queries are strongly consistent.



Atomicity - Changes to data performed as if they are a single operation. 


Consistency - Consistent state when transaction starts and when ends.


Isolation - Transaction is invisible to other transactions. As a result, transactions that run concurrently appear to be serialized.


Durability - After a transaction successfully completes, changes to data persist and are not undone, even in the event of a system failure.


(Sharding is a type of database partitioning that separates very large databases into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.


Fast and highly scalable.

Focus on building your applications without worrying about provisioning and load anticipation. Cloud Datastore scales seamlessly and automatically with your data,allowing applications to maintain high performance as they receive more traffic.


Diverse data types

Cloud Datastore supports a variety of data types, including integers, floating-point numbers, strings, dates, and binary data, among others.

First launched as the default way to store data in Google App Engine.

Designed to handle large-scale data.





5.1.1. Design goals for Cloud Datastore


Large-scale storage system makes for a great example: Gmail

Data locality

Mail database would need to store all email for all accounts, you wouldn’t need to search across multiple accounts.


The concept of where to put data is called data locality.


Datastore is designed in a way that allows you to choose which documents live near other documents by putting them in the same entity group.

Result-set query scale

Frustrating if your inbox gets slower as you receive more email. 

Index emails as they arrive so that when you want to search your inbox,

Would be proportional only to the number of matching emails.

Same amount of time regardless of whether you have 1 GB or 1 PB of email data.

Automatic replication

The fact that sometimes servers die, disks fail, and networks go down. 


Email data in lots of places so it’s always available.


Data written should be replicated automatically to many physical servers.


Email is never on a single computer with a single hard drive. Instead, each email is distributed across lots of places.




5.1.2. Concepts

Keys

The idea of a key, which is what Cloud Datastore uses to represent a unique identifier for anything that has been stored.


Relational database unique ID is the first column in tables.

Datastore keys have two differences from table IDs.


Doesn’t have an identical concept of tables.

Keys contain the type of the data and the unique identifier.

Employee data in MySQL, create a table called employees.

Column in that table is id that’s a unique integer.


Hierarchy


Store your employee records as children of the company they work for, Company:1:Employee:2

The kind of this key is Employee

The parent key is Company:1 (whose kind is Company)

Key refers to employee #2, and because of its parent (Company:1


Stored near all other employees of the same company; for example,

Company:1:Employee:44 will be nearby.

Can specify keys as strings, such as Company:1:Employee:jbond or Company:apple.com:Employee:stevejobs.

Entities

Primary storage concept in Cloud Datastore is an entity

An entity is nothing more than a collection of properties and values combined with a unique identifier called a key


An entity can have properties of all the basics, also known as primitives

  • Booleans (true or false)

  • Strings (“James Bond”)

  • Integers (14)

  • Floating-point numbers (3.4)

  • Dates or times (2013-05-14T00:01:00.234Z)

  • Binary data (0x0401)


{

  "__key__": "Company:apple.com:Employee:jonyive",

  "name": "Jony Ive",

  "likesDesign": true,

  "pets": 3

}

Datastore exposes some more advanced types

  • Lists, which allow you to have a list of strings

  • Keys, which point to other entities

  • Embedded entities, which act as subentities

{

  "__key__": "Company:apple.com:Employee:jonyive",

  "manager": "Company:apple.com:Employee:stevejobs",

  "groups": ["design", "executives"],

  "team": {

    "name": "Design Executives",

    "email": "design@apple.com"

  }

}

Reference to another key is as close as you can get to the concept of foreign keys.


  • In the context of relational databases, a foreign key is a field (or collection of fields)

  • in one table that uniquely identifies a row of another table or the same table.


Lists of values typically aren’t supported in relational databases

In Datastore, if that structured data doesn’t need its own row in a table, you can embed that data directly inside another entity using embedded entities.



 Comparison with relational databases

While the Datastore interface has many of the same features similar to relational databases, as a NoSQL database, it varies in how it describes the relationships between data objects. Here's a high-level comparison of Datastore and relational database concepts:

Concept

Datastore

Firestore

Relational database

Category of object

Kind

Collection group

Table

One object

Entity

Document

Row

Individual data for an object

Property

Field

Column

Unique ID for an object

Key

Document ID

Primary key

Datastore is ideal for applications that rely on highly available structured data at scale.

Things you can do to an entity. The basic operations are

  • get—Retrieve an entity by its key.

  • put—Save or update an entity by its key.

  • delete—Delete an entity by its key.



All of these operations require the key for the entity

Omit the ID portion of the key in a put operation, Datastore will generate one automatically for you.

Indexes and queries

Typical database, a query is nothing more than a SQL statement, such as

SELECT * FROM employees.

Using GQL (a query language much like SQL).

Datastore uses indexes to make a query possible (table 5.3).

 

Table 5.3. Queries and indexes, relational vs Datastore

Feature

Relational

Datastore

Query

SQL, with joins

GQL, no joins; certain queries impossible

Index

Makes queries faster

Makes advanced query possible

Creating an index that stays up to date whenever information changes and that you can scan through to find matching emails.


Index is nothing more than a specially ordered and maintained data set to make querying fast.


Table 5.4. An index over the sender field

Sender

Key

eric@google.com

GmailAccount:me@gmail.com:Email:8495

steve@apple.com

GmailAccount:me@gmail.com:Email:2441



Index pulls out the sender field from emails and allows you to query over all emails with a certain sender value.


When the query finishes, all matching results have been found.

5.1.3. Consistency and replication

Distributed storage system for something like Gmail needs to meet two key requirements:

To be always available and to scale with the result set. 


One protocol that Cloud Datastore happens to use involves something called a two-phase commit.


Datastore chose to update data asynchronously to make sure that no matter how many indexes you add, the time it takes to save an entity is the same.


  • Create or update the entity.

  • Determine which indexes need to change as well.

  • Tell the replicas to prepare for the change.

  • Ask the replicas to apply the change when they can.


  • Ensure all pending changes to the affected entity group are applied.

  • Execute the query.



Datastore uses these indexes to make sure your query runs in time that’s proportional to the number of matching results found. 


  • Send the query to Datastore.

  • Search the indexes for matching keys.

  • For each matching result, get the entity by its key in an eventually consistent way.

  • Return the matching entities.




The indexes are updated in the background, so there’s no real guarantee regarding when the indexes will be updated.


Eventual consistency, which means that eventually your indexes will be up to date (consistent) with the data you have stored in your entities.

Listing 5.1. Example Employee entity

{

  "__key__": "Employee:1",

  "name": "James Bond",

  "favoriteColor": "blue"

}

SELECT * FROM Employee WHERE favoriteColor = "blue"


If the indexes haven’t been updated yet (they will eventually), you won’t get this employee back in the result. 

Eventually consistent, specifically because the indexes that Datastore uses to find those entities are updated in the background. 

Change this employee’s favorite color.



Data is written to Datastore in objects known as entities. Each entity has a key that uniquely identifies it. An entity can optionally designate another entity as its parent; the first entity is a child of the parent entity. 

Listing 5.2. Employee entity with a different favorite color

{

  "__key__": "Employee:1",

  "name": "James Bond",

  "favoriteColor": "red"

}



You may see different results



5.5.3. Durability


Because Megastore was built on the premise that you can never lose data, everything is automatically replicated and not considered saved until saved in several places.


Datastore handles this entirely on its own, meaning that the only setting for durability is as high as possible.


Global queries being only eventually consistent.


Data needs to replicate to several places before being called saved.

5.5.4. Speed (latency)

Compared to many in-memory storage systems.


Cloud Datastore won’t be as fast for the simple reason that even SSDs are slower than RAM.


Relational database systems like PostgreSQL or MySQL, Cloud Datastore will be in the same ballpark.


As your SQL database gets larger or receives more requests at the same time, it’ll likely get slower. 


Cloud Datastore’s latency stays the same regardless of the level of concurrency.


Cloud Datastore certainly won’t be blazing fast like in-memory NoSQL storage systems, but it’ll be on par with other relational databases .


5.5.5. Throughput

Accommodate as much traffic as you care to throw at it. 


The pessimistic locking that comes with relational databases like MySQL doesn’t apply.


Able to scale up to many concurrent write operations.


Adding more servers on Google’s side to keep up.

Summary


  • Document storage keeps data organized as heterogeneous (jagged) documents rather than homogeneous rows in a table.

  • Using document storage effectively may involve duplicating data for easy access (denormalizing).

  • Document storage is great for storing data that may grow to huge sizes and experience huge amounts of traffic, but it comes at the cost of not being able to do fancy queries (for example, joins that you do in SQL).

  • Cloud Datastore is a fully managed storage system with automatic replication, result-set query scale, full transactional semantics, and automatic scaling.

  • Cloud Datastore is a good fit if you need high scalability and have relatively straightforward queries.

  • Cloud Datastore charges for operations on entities, meaning the more data you interact with, the more you pay.
















No comments:

Post a Comment

 Assignment # 6 due Friday 10/24/25 

  https://uconnstamfordslp.blogspot.com/p/assignment-exercise-python-datastore.html  Recreate the app engine project above and send me a wor...