Google IO Bootcamp codelab : Hands on with the datastore

- Ikai Lan


This is an introduction to Google App Engine's datastore. We will be building a Link sharing site. During the course of this code lab, we learn about working with the datastore by writing code. We will cover the following topics:

The completed application will look something like this: ShareSite


To participate in this codelab, you will need the following:


Our link sharing site will allow users to do the following:


Clone or download the repository. You need to have ant installed.

Our data model

The basic data model is fairly simple. We have to record a few pieces of data for each Link:

Part 1: Defining basic persistence methods

Let's go through each of these steps now.

Step 1: Conversion to and from a POJO

Take a look at the toEntity() and fromEntity() methods in These are already defined. Note the use of indexed and unindexed properties: we need indexed properties on every property we intend to query against. Note the implementations below:

In our toEntity() method, a few things are happening:

We'll need to convert the Entity class to Link classes. Below is the implementation of the fromEntity() method:

Fortunately, this is fairly straightforward. getProperty() can be used on any property. It does not distinguish between indexed and unindexed properties. We have to cast the property to the proper type.

Step 2: implement put() and get() in LinkDAO

Let's define the put() method:

Here is the corresponding get() method:

These are basic persistence methods that convert the Link POJO to a low level Entity instance and save the instance.

Part 2: CommentDAO

Let's add comments to our model. Comments belong to a Link. It is not possible for a Comment to associated with more than one Link. One Link may have many Comments. This seems like a very good place to make use of Entity Groups, a hierarchical organization of related data that can be treated as a single unit of data. We'll set each Link to be an Entity Group root and set the ancestor of each Comment to the Link object it references.

CommentDAO's put() method is already defined. It looks pretty much exactly like LinkDAO's put() method, so there's no value in us repeating that exercise. Instead, we'll define 2 methods: createCommentForLink() and getForLink().

The job of createCommentForLink() is to accept a Link, create a Comment that contains the Link as the parent entity, and return the Comment without saving it. While that may seem simple, we'll have to take some extra steps to make it work the way we want to:

In this line:

We ask the datastore to set aside some IDs in the global namespace that contain the parentLink's key as an ancestor. These IDs are guaranteed to be unique and will never be auto-reassigned again, though it is possible for a developer to manually reuse the assigned IDs. The first parameter we pass is the parent Key. The second is the enity Kind. The third parameter is the number of IDs to allocate.

Since we only have one Key returned in this KeyRange, we can fetch it like so:

The last part of this code is standard Java interaction with the Comment POJO.

The code in should now work correctly. Note that we are not transactionally incrementing the numComments field of Link. This has the potential to introduce a race condition if two Comments are saved at the same time. Our code doesn't currently make this transactional, mostly for the sake of simplicity. We would probably want to address this in a future update.

We'll want to implement CommentDAO's getForLink() method. This uses a pretty standard Query interface:

We're really only adding two lines of code here:

The second line is the important one. We're telling the datastore to do an ancestor query by setting an Entity root. Ancestor queries are queries that span a single Entity Group. The unit of atomicity for strong consistency in High Replication datastore, the default option as of 1.5.0, is an entity group, thus it is important for us to organize data that functions as a single unit into entity groups when possible. If we created comments without setting an entity root and queried shortly after an update using a non-ancestor query, the changes may not be visible right away. With entity groups and ancestor queries, we are guaranteed to see changes after they are committed.

Part 3: LinkDAO

To complete the example, we'll need to revisit LinkDAO and finish building it out. We'll add a few more methods. These methods fall into one of two categories. We have the query methods:

We also have the methods for dealing with the view counter:

We'll implement one of the query methods first: getNewest():

The other methods are similar and we won't cover them here. It may even make sense to extract out a method in a future refactoring.

The two view counter methods are the interesting ones. We want to track how often a Link is viewed and we want to be able to sort on the property. A naive implementation would be to simply update the viewCount property on a Link every time a user browses to the page. This will work, but it'll run into a scaling wall very quickly as more users use the site and the view rate of the page exceeds our single entity write write. What if our site starts to attract hundreds or even thousands of visitors a second? With the naive counter implementation, we would either lose data or cause a large amount of write contention, resulting in errors being exposed to end users. Two possible solutions here are:

  1. Increment a counter in Memcache. Periodically flush this counter to the datastore.
  2. Use a sharded counter implementation.

Solution 1 is probably more than adequate for this particular use case, as the risk of Memcache losing our data wouldn't be the end of the world, especially for data tracking view counts. In most cases, simply being in the general ballpark is good enough.

We'll implement solution 2 because it helps us to introduce the principle of sharding. Sharding is the act of taking some data and splitting it across multiple logical entities. By employing sharding, we can linearly scale the total capacity of a single logical entity by adding more capacity. Effectively, we remove a bottleneck. Sharding is an example of an implementation detail - users do not care how their data is persisted, so as long as it is correct and they have access to it. Scalability is not possible without data sharding.

Let's revisit our requirements. We want to be able to:

  1. Sort links by most views
  2. Have a pretty accurate view count stored with each Link so that we can take advantage of the datastore's indexing when we query and sort by the number of views

Our implementation will look like this:

  1. When a user views a link detail page, we write to a sharded counter. Each Link is associated with N number of entities representing a distributed count of the total number of views this Link has recorded. To minimize contention, this counter will be a root entity.
  2. When a user views a link detail page, we create a new task queue task, scheduling it for the next discrete minute and name it as such. When we name a task, it cannot be enqueued twice in a one week period. The Task Queue API will throw an exception. The job of this task queue task is to collect all the sharded counters, total up the count and update the Link entity with the view count.

We're going to make a compromise here: the view counts will not be entirely up to date, since they will be updated once a minute by a background task. This is perhaps the simplest example of eventual consistency. It takes a minute (or more) for the internal state of the Link entity to become “correct”, but it does eventually become correct. This is a compromise we can live with, as it allows us to greatly increase our write throughput.

We'll implement incrementTotalViewCount() first:

We're going to reference the shards by Key. An alternative implementation would use queries and query for all the counter shards, updating the one we want, but doing it this way gives us:

  1. Strong consistency in high replication datastore (gets by key are strongly consistent)
  2. A little bit of speed. Not a lot - just a little. We also only retrieve the data we want
  3. Transactionality. We would not have been able to transactionally read data across multiple entity groups and update it within a single transaction

In these two lines, we've programmatically constructed the key based on the randomly generated shardNumber value, a value between 0 and the number of total counter shards this Link has. By using an unindexed property instead of a fixed value globally, this gives us the flexibility to assign more counter shards for a given entity if it turns out to require an even higher write rate.

Now let's implement the actual incrementing of the counter value:

We transactionally get the counter shard by key. If it doesn't exist, we create a new Entity with the Key we created and set the value to 1.

Now we need to implement the flushCounterShards() method. This is the method that our background task will use to update the viewCount property on the Link entity:

Since we know the total number of view counter shards a link will have, we can build a List of keys to do a batch get.

We iterate through these Keys, retrieving the counter property and adding it to our running total. We update our Link with this total and save it.

There's one last part to our flush implementation: the deferred task that will run the code. Fortunately, this is fairly simple. Open and complete the implementation of the run() method:

This code is being scheduled to run in

The task name ensures that we will not enqueue two tasks with the same name. In this case, we are concatenating the Link ID with the next discrete minute, represented in milliseconds as a unique identifier.

Run the unit tests using

        ant test    

All the tests should now be passing.


The self-study in this lab is to implement two features: - most popular in the last 7 days - most popular today