Google I/O 2009 – App Engine: Scalability, Fault Tolerance..


>>KOCHER: All right. Hello, everyone. Welcome
to App Engine Nitty Gritty. We put up one of those Google moderator things, so if you
want to jump over there, a tiny world called App Engine Nitty Gritty. Welcome to submit
questions, we’ll take up from the mics as well. As people are trickling in here, probably,
people are grabbing food, just going to tell you a little bit about that what we’re covering
today. One is that for folks who went to Brett’s session yesterday afternoon, he had a session
on scaling Apps and App Engine. There’s going to be no overlap at all with that one. We’ll
be covering you know, also things about scalability but, none of the same things. We definitely
recommend his talk though if you didn’t catch it. So, we’re going to talk about some things
that are unique about building an App Engine. How you can, you know, when you’re on a system
where you expect to have some errors, how you can build something that’s stable and
reliable on top of that? Knowing that, you know, some of those parts can and will at
times fail. We’ll also talk about some of the lessons we learned and recommendations
about how to make your upscale well, and we’ll also talk about integrating external services.
In our case, that was Amazon EC2. We’ll talk about why we did that, why you might want
to do that, it could be a different service, but just looking at how you might have things
that don’t fit on to App Engine that, you know, are important to your App. We are from
FrontSeat.org which is a civic software company up in Seattle. And, we were founded around
the idea that software is getting cheaper and cheaper to build, and because of that
we can start applying software to, you know, civic issues, things where in the past, it
might have been prohibitably expensive but with the falling costs, there really are some
new opportunities out there. I’m Jesse. This is Dave and Josh, and I’m the lead developer
at Front Seat. These guys each run their own consulting companies and have been working
with us for a long time, and they’ve been very involved in all of our work with App
Engine. I’m going to give you a couple of examples of the kind of things I’m talking
about civic software before we dive in. This is the site we did just after the election
last fall where Obama had said, “I want to appoint a chief technical officer for the
country,” but had no way of finding what that role was and so we put up a site. It was really
quickly, it took us about a day to launch this and let people submit, discuss and vote
on ideas for what that position should be about, what the priorities for that position
should be, so you can see the top result here was about a combination of neutrality and
accessibility. Earlier, last summer, we did a site where students who went to school in
a state other than their home state can go and say, “You know, I’m from this state. I’m
at school on this other state,” and the website we tell them where your presidential vote
will count more and then it would help them register to vote and get an absentee ballot.
So we have a whole bunch of projects. I’m going to skip over with these other ones if
you want to know more about the kind of stuff we’re doing. You can check out frontseat.org,
it links to all these things; everything from utility bill designs that promote conservation
to satire about the payday loan industry. Projects we’re going to talk about today is
called Walk Score, and before we dive into that, you know, the context of all of these
is that, we’re a very small team, we have a lot of projects and we need to minimize
our overhead and our maintenance, and that’s the, the context for our use of App Engine.
So we wanted to bring a Civic Software service to scale on a budget with a very small staff,
you know, no system staff to sit around and monitor it all the time. So, on to Walk Score–Walk
Score is a website that lets you measure the walkability of any neighborhood in terms of
access to amenities. So, it’s, you know, how much can you do without getting in a car,
and it’s–I’m going to give you a demo really quickly of the original Walk Score website.
It started out with Google Maps mesh ups. I’m going to search here, in search for any
address. I’m going to search for the Mascony Center address. So you can see, if our network
cooperates, my mother also seems to be a little bit confused about what size it is so, sorry
that is off-center here. So, as these searches complete, it pulls up a bunch of amenities
in the neighborhood and it calculates the score, up here at the top, from zero to a
hundred of how walkable that neighborhood is. So, there’s a ton of stuff around here.
You can do a lot of things about getting in a car, so you get a very high score, 98 out
of 100. And on this side here you can see things like the restaurants, the coffee shops,
got parks and schools, all these different amenities. And, we’ve got integration with
Yelp. So if you look at the Verba Buena, it’ll search for reviews and pops up a little picture
there, you can click through the reviews. And we also have street views so, maybe you
want to go coffee and see, you know, what’s the closest thing that’s not Starbucks. Jump
over here, and I think that’s it. Yeah, there’s that coffee shop. So, this Walk Score indicates,
you know, how much you can do without getting in a car, and it’s also an indicator of the
vibrancy of the neighborhood. And, this part that I just showed you, this is great for
checking out a specific address, but what if you want to get to know a city at a larger
scale, we did some work where we’ve generated these heat maps that can show you in green
the most walkable areas and fading out to red, the least walkable. So this is the map
for Seattle, you can see, those who know the city a little bit, downtown, Capitol Hill,
university district, and you have these little pockets, these little neighborhoods that are
down Columbia City and West Seattle where you have some kind of little neighborhood.
And this is a great way to get a quick insight into the shape of the city in terms of, you
know, where people actually hang out on the streets and where the street life is. So why
are we doing all these work about walkability? Well, it turns out that walkability encapsulates
a whole bunch of great things into one concept; it’s very easy for people to understand. So, here are those benefits very quickly:
climate when people walk more, they drive less, they emit less greenhouse gases. Health,
people weigh about seven pounds on average–seven pounds less on average in walkable neighborhoods.
These neighborhoods tend to have strong social capital, good transit options, very few auto-related
deaths. Also, a lot of economic benefits, home values tend to be higher or appreciate
faster or in this climate, fall more slowly. And, transportation costs are a big part,
about 18% of household income, much less in walkable neighborhoods. And they tend to have
very strong local economies where local businesses can thrive. It’s also something that people
rank very highly in terms of choosing where to live, they say walkability or access to
amenities is one of the top two things higher than property taxes or schools. So, what we
tried to do is create demand for walkable neighborhoods by educating people about the
benefits and then, fill that demand by providing transparency about the walkability of every
property. And, we want to, you know, help people find these places and also create more
of them. So, walkability matters most in this kind of transparency, it matter most when
people are choosing a place to live and that means, we need to be on real estate sites.
So, the first thing we did for that is build this widget that’s very easy to, for people
to embed, and this is basically a miniature version of what I showed you before. It’s
that rectangle part, it’s what can be embedded into a real estate site, and we see a lot
of adoption to that. We’ve got about twice as much traffic to that tile as we do to our
website, so it’s been really successful. But, they often put us down in the kind of “about
the neighborhood section,” it’s not really the ideal placement. So we really want is
this, we want to be right in the primary information about a property: three bedrooms, two bathrooms,
walk score: 86, and we want “Search By Walk Score.” And so to do those things, we need
to build an API that gives scores back to a real estate site very quickly, so they can
just pass us latitude, longitude and we give them a score, they don’t want to wait for
all those local searches to happen and all the restaurants to load and all that. So we’re
going to jump into the text stuff now, and we’ll talk a little bit about why we brought
EC2 into the picture, why we didn’t just do everything that we built for the API on App
Engine? And we’ll talk a little bit about some guidance on making App Scale and some
issues that we ran into and thoughts about working within the App Engine environment.
I’m going to hand it over to Josh to take us in.
>>LIVNI: Thanks, Jesse. All right. So, when we first–talking about the API, this is pretty
much our ideal workflow for the entire API. All we really want to do is have a request
come in, latitude, longitude, location; goes to this magic box, spits out a score response,
that’s pretty much it. And, of course, things get a little bit more complicated than that
but before we get to the complications, we had to figure where we’re going to host this.
It is pretty straightforward; we could’ve run our own servers, and so when we’re considering
where to put it, these were some of the, these are unfortunately just the negative considerations
at first glance of App Engine. But some of the things we thought about were vendor lock
in, if we write code for App Engine, are we going to be able to, is it portable a little
bit because it’s similar to Jango but, we’re going to have to rewrite some things in case
we had to move. The cost when we started this and launched it, there was no pricing announced,
so we didn’t know what we were in for. We assumed here it would be, you know, a competitive
cost to host on App Engine. And it’s a data product, so we didn’t know, is it going to
potentially go down for six, eight hours, make us look bad, are we going to have various
other problems. And we decided after all of these, you know, that Google engineers probably
are going to keep things up better than we would with the series of EC2 or other virtual
machines. Cost would be good and we weren’t super worried about the portability of the
codes, so we went with App Engine. And the next step then is to figure out what I considered
kind of the core functionality, and the core functionality is just this, this is really
what we need, the ability to return responses really fast. So, the criteria for us is always
on, let’s go ahead and separate out this core functionality from these other pieces and
I’m going to show you some of these other pieces that we considered secondary functionality.
We put the secondary functionality on App Engine where it might conflict potentially
or cause a problem or our core functionality of returning responses would fail. Should
we put it somewhere else? How do we integrate these other pieces? I’m going to spend a little
time just talking about some of these secondary pieces that we decided not to put on App Engine
for something that we did. So, App Engine, of course, is really, really, really good
for simple operations. A request comes in; does something basic such as look at the score,
return it. Not so good for certain other things that it’s just maybe not designed for. Some
of these things Jesse mentioned such as the rankings. We have a fairly complex procedure
to figure out the rankings. It’s not just counting up points in a polygon from different
neighborhoods and cities. We bring in all kinds of demographic data and weight by population
with Walk Score and App Engine is not really set up for that kind of geo-capabilities.
Some folks I know have done some back and worked with this but we do all of these uploaded
on a post [INDISTINCT] database, and so that’s offline. The other is, what if we just want
to look at the API usage? Who’s using stock in a day or a week or a month, and where are
the queries coming in? So this is Seattle API usage over a given time, and we want to
know a little bit. Are people coming in and maybe doing a kind of a survey or an academic
study where they’re going to request a couple of hundred thousand points in a specific area?
Is it people just looking at houses all over the place and when, and this helps us decide
where to pre-calculate points. We want to make sure we can respond really fast to places
so we see the API with some obvious things such as all the sort of kind places or cities,
maybe the top thousand cities will score all of the possible walk scores there. But where
else do we go next? By looking at the usage, we could see things like, oh, people query
for houses mostly within two miles of a city center. And also, when we’re doing the rankings,
are going to rank over urban areas? Are we going to rank over this statistical metropolitan
stuff from the census? So, we did this offline and then precede the cache offline with places
of interest. And all of this pre-calculation stuff brings in the complication to that really,
really simple workflow which is we have to actually calculate the score to give it out.
And, so the API gets a little bit more complicated if we don’t have a score, we return saying
“We don’t have a score” then we have to go and do something about that, and to explains
a little bit why we decided to off load the scoring portion outside of App Engine, I’ll
hand over to Dave.>>PECK: Okay, so, Josh showed us a number
of the reasons that we couldn’t strictly build the Walk Score API on top of App Engine. And
I’m going to talk through sort of the central reason that we had to make use of Amazon EC2
and that’s that, as Josh hinted that, that Walk Score Calculation takes time. Now, calculating
the Walk Score is not CPU intensive but it is rather I/O intensive. At a minimum, Walk
Score requires us to make 17 URL fetches to Google local search, and potentially talk
to other services, census services, geo-coding, et cetera. Now, if you’re an individual user
and you go to our website and type in an address, your browser and our JavaScript will do most
of the work for you. But if you’re a real estate company what you really want is programmatic
access and you want us to do the work of calculating Walk Score on your behalf and so that’s why
we built this API. From a customer’s perspective, the request response cycle as Josh mentioned
is, give us the latitude and longitude, get back a score if we’ve already calculated it,
otherwise, queue it up. And we’ve actually built a reliable message queue abstraction
on top of the App Engine data store, to hold on to the latitudes and longitudes that we
haven’t yet calculated. Of course, at some point, we have to turn around and actually
service that queue. And so, we looked at a few options when we we’re building our API.
The first option we looked at was App Engine Cron Jobs, and as you know, App Engine Cron
jobs allow you to regularly ping an App Engine URL handler. That URL handler is subject to
many of the same restrictions that standard URL handlers are subject to, on App Engine.
So, in our case making 17 URL fetches is probably not going to happen in the timeframe that
App Engine request response are allowed. So, unfortunately, Cron jobs didn’t look like
a particularly good answer. And, of course, the second thing we looked at is whether App
Engine supported background tasks, and obviously, if you were at Brett’s talk earlier, you will
know that the new task queue API is going online in a few weeks and we’re very excited
about that. We believe that we can move a lot of our functionality of our Walk Score
calculation on to App Engine. But, six months ago, when we started building this API, background
tasks weren’t available on App Engine, so it wasn’t really an option for us. So, where
do we turn to? Well, we turn to Amazon EC2. For those who don’t know EC2, it’s Amazon’s
API to spin up virtual servers, you have full control of the machine, you can choose whatever
operating system image you want, and for us the things that it gave us was a place to
do background processing, they build and they did an arbitrary number of URL fetches, of
course, just arbitrary I/O’s since you are on the machine. And, the design of our calculator
is very parallel. We have lots of processes running, working on different latitudes and
longitudes making different network requests, we needed a place to run lots of processes
in parallel. So, in other words the Walk Score API is built with both App Engine and EC2,
and when you start to build the service with multiple cloud computing environments, there
are a few important considerations to keep in mind. So, I just wanted to show this perspective.
Here, we’ve got the customers code on the left, our App Engine code in the middle and
our Amazon EC2 code on the right. And here’s a customer request where we’ve already calculated
the score, they give us the latitude and longitude, we checked to see that it is already calculated.
We package up a response and send it back to them, and here’s the other type of customer
request, we haven’t actually seen it. The details of what we do on the App Engine side
aren’t really important and as Josh mentioned it’s not very complex. But the important thing
to see here is that, during a customer request response cycle, Amazon EC2 has never been
touched. So, Amazon EC2 is simply our queue servicing code, and customer request never
get to Amazon. And that’s a really important design point that I’d urge you to think about
if you are architecting an API for multiple cloud services. For us, what it means is that
our App time and our scalability is not a function of both App Engine and EC2s App side,
up time, or rather our downtime is not a function of both of them, it’s only strictly tied to
App Engine for us. If Amazon EC2 goes offline, what that means is that it’ll take a little
longer for us to process requests in our queue. So, we’ve architected this API and it’s been
running for the last six months and I want to talk a little bit about the behavior that
we’ve seen during the last six months. So, this is sort of the, the bottom line thing
that I’d like to stress and, of course, App Engine is a beta service so some amount of
unpredictable performance is predictable. But, we saw a number of concerning things
along the way which have really smoothed out over this beta period. So, the number one
thing that we struggled with while running our API in production is high data store contention,
and by contention what I mean is when we make a fetch or a put to the data store, we see
a time out, so contention was the percentage of timeouts we saw in a given set of requests.
And in particular for us where we saw a contention was accessing our queue data structure. So,
obviously, customer request are coming in a rather rapid rate. They’re adding new latitudes
and longitudes that we haven’t calculated yet to the queue, at the same time our calculators
are trying to pull them off, computing them, and then pushing them back to App Engine and
removing those entries from the queue once they’re calculated. So, a typical day for
us today is about a half percent failure on reading from the queue. Just two days ago
at around 1:00 a.m., I think, we saw App Engine contention rise up to about 50% or 60% for
about six hours, so a really big surprise; something that we had to architect for on
the Amazon EC2 side where we’re servicing our queue. Obviously, we’d buffer up a lot
of latitudes and longitudes there so that if we do, or if we’re unable to get data from
App Engine for a while, we can still continue to calculate Walk Scores. So, another thing
that could happen during this, especially during beta period is that the data store
goes offline or simply goes in to read-only mode. For those of you who have App running
application over the last six months, you’ll know that this has happened from time to time.
What you see normally in this case is when you make a database, data store request, you
see a capability disabled exception. Another thing that we’ve seen a lot is that the App
Engine response time increases and we measured this on the EC2 side, so as our calculator
pulls new points from our App Engine queue, normally those requests take about half a
second to a second. Every once in a while the response time goes up to several seconds
so, in other words latency increases. App Engine is a bit, a bit of a black box, so
it’s hard to always understand why this is happening. You know, with no changes to your
code and with no changes to your underlying data models you might see this. And this is
a very rare occurrence but has happened once or twice in the last half year. So, with those
performances issues in mind, what can you do to make sure that your application scales
as far as you need it to? And so what I’m going to talk about here are sort of the steps
that you can follow to make your applications scale as far as you’d like. These steps are
distilled from our experience building the Walk Score API but also building other APIs
on top of App Engine that we’re working on. So, they may not apply in your condition,
they’re just good rules of thumb to keep in mind as you’re building your application.
So, scalability on any system is really about stair steps, you do a certain amount of work
you get to the next step. To get to the step higher, it might require substantially more
work on your part. And so, at the bottom rung for App Engine when you’re just getting your
feet wet, here are some of the things that your code might have. So, first of all, inconsistent
model design, either your models are very large and you only access one or two properties
in any amount of time, or your models are very small and distributed, and you end up
accessing a cluster of different models in a single request. In that case, you probably
want to re-think the design of your models, shape them a little bit differently. So this
is a characteristic that I’ve seen in, just starting out App Engine applications especially
for people who’ve worked in the sequel world before and are just starting to move over.
Uneven or no memcache use, it’s a lot of fun and it’s a lot easier to rate your application
without thinking about memcache at first. Unfortunately, pretty much all good App Engine
applications are going to make some or very heavy use of memcache early on in their life
cycle. Well, the final thing is, for those who don’t know every time you make a data
store request of fetch or put, you’re effectively making an RPC request somewhere in Google’s
data center. So, you’re able to batch these requests but it can be difficult to design
your code that way so, a lot of early stage code that I’ve seen doesn’t do that. So, with
this sort of naive style of coding with App Engine, we’ve been able to see something like
five queries a second handled which is actually really rather amazing. I should point out
that this number is in our experience depending on the type of application that you’re building,
you may see something different. But this is approximately what we saw when we we’re
starting out with the Walk Score API. Five queries a second is over 10 million requests
a month. It started with really rather large application and it speaks to the power of
App Engine as a platform to get you scaling very fast right out the door. So, where do
you go from there once you want to go pass that five or so queries a second? Well, the
very first thing I’d urge you to think about is starting to use memcache and the easiest
way to use memcache is just slather it on everywhere you read data from your data store.
So, the basic behavior, of course, is if you’re going to read an entity from the data store,
check to see if you’ve cached it first. If you have, great, you’re done, if you haven’t,
read it, add it to the cache. Oh and be sure when you’re writing back or updating that
entity in the data store to either clear a memcache or update it. And if you actually
go and look at the Google App Engine cookbooks, somebody uploaded, I think, about a month
or two ago, a really great shim for the App Engine data store that just causes memcache
on read to happen everywhere, so, as a first step, that might be something to look into.
Another thing to think about is to batch your data store request where easy, where easy
for us was anywhere we had non-nested loops. So anywhere we had a loop that we hadn’t,
we we’re putting single entities; we inverted that, now we call one put with lots of entities.
And I should mention that batching is a little bit tricky because you can only fetch up to
a thousand entities at a time from the data store, and puts are limited based on the size
of the entities you’re putting in data store, a good rule of thumb is 50 at a time. So,
you may need to have multiple batches depending on the size of operations you’re performing.
So, without work, we saw a pretty much a double in our variability to handle load which is
actually, again, really rather impressive and that’s not so much engineering work on
our part. But we wanted to go a lot further, and of course, the running Walk Score API
see substantially more than 10 queries a second on any given day. So, what do we have to do
next, and the next of the tips are really a grab bag of things that you might want to
think about as you really scale out your application. The first is that I’d urge you to think long
and hard about how your data store is accessed and what type of usage pattern do you see.
So memcache is really great in particular for two types of usage pattern. One is repeated
request of the same data and the other is predictable sequential access to data. So,
for a repeated request of the same data, if you know your customer is going to hit that
same entity again and again, “cache on read,” like we discussed in the previous step is
actually really a great strategy. But if you have a predictable sequential access, for
example, in the case of the Walk Score queue, we know that we’re going to pull out those
queue items in order or “Calculate our code” is going to request them in a certain order.
So our calculator talks to App Engine and it requests 50 items at a time from the queue.
But on the App Engine side, what we do is we actually pull out a full thousand items
from our queue, put that entire list in the memcache, and that obviously cuts down substantially
on a number of data store accesses we need to make. So, think about memcache usage patterns.
Batch all of your data store calls, this went a long way for us to scaling outward, and
for us what that meant is unraveling nested loops. Sometimes they were loops that were
basically a cross method boundary so we had to flatten a lot of our code up but it really
helped us a lot. As important as using memcache carefully is, sometimes it’s just not to right
thing to do to use memcache depending on your access pattern. It may not, memcache may not
provide a very meaningful barrier between your users and the data store, and that’s,
memcache is a limited resource. You obviously don’t want to populate memcache with entities
that aren’t useful to you and, and lose the entities that are. And, of course, memcache,
when you add items to memcache, you can, of course, think about how long they stay there.
Sometimes that’s a very great way to keep memcache pressure low. And then, I would urge
you to load test your application. We actually built a load testing hardness on Amazon EC2.
We’re able to hit, our running Walk Score instances with over a hundred queries a second
and that’s real load, talking to real URL handlers that actually do real work with the
data store. So, load testing, there are third party products to help you with that too.
App Engine load tends to be surprising once you get pretty high, past 10 queries a second.
And the last thing is, of course, monitor your performance. Now the frontline of defense
for monitoring performance is the, is the App Engine Dashboard which is a great place
to go and of course, your logs to look at individual requests that, for example, it
gave you a data store time out. Also the system status if you’re seeing a lot of latency is
a good place to go, just look at the overall behavior of App Engine. We actually built
our own performance dashboard for the Walk Score API which we’re happy to show you if
you’d like after this talk. And it monitors specific things about the behavior of our
EC2’s calculator code communication with App Engine. And just sort of one last technique,
if you’re getting started with App Engine coding, that I think everybody should know
which is, if at first you don’t succeed, try, try again. What we have here is basically
an attempt to write to the data store that time, if it times out, we’re going to turn
around right away and write again. Just a few milliseconds later, you may not get a
time-out exception and this is something we do everywhere in our code base now, both on
the read and write side and it’s actually extremely helpful. So, I’ve talked a little
bit about general principles for scaling out App Engine applications, and what I want to
do now is turn it back over the Josh, who’s going to talk about one specific instance
where we had some difficult scalability issues that we needed to tackle and the sort of unorthodox
technique we used to solve. So, thanks.>>LIVNI: Thanks Dave. So, yeah, I started
out talking about kind of the core functionality and secondary functionality. Obviously, your
guy’s applications are going to have different pieces of secondary functionality that may
or may not fit different places. A really common use case on App Engine is counting
stuff. I’m going to talk about specific issue we have counting our user’s request. We have
a basic quota system, we have an API key. We want to know who’s basically doing what
so, certain folks might get, you know, a couple of hundred thousand requests a day. Others
might get, you know, maybe a couple million a day. But we want to make sure that people
can’t just run away and abuse the system and again, we understand who’s doing what and
where. So counting every request that comes in, matching it to users should be pretty
straightforward and the answer that’s usually given for this is I’ll just use sharded counters.
Every time a request comes in, write it to one of this shards, you’re good to go. But
every time a request comes in, if that request is, you know, a hundred requests a second,
that’s a lot of shards and you’re going to really hit a lot of data store contention.
So, we found that these sharded counters didn’t scale as well as we had hoped, and what happened
in our case, this was from about six months ago and things are really bit different now.
Around 30-40 queries a second, our response time started creeping up, you know, two, four,
six plus seconds and then, along with the error rates and then all of a sudden, everything
was 503s and this is a really big problem because again, core functionality should not
be hurt by this sort of icing on the cake of counting who’s doing what. So, we were
finding that just because we want to have quota system now nobody can even get a score
because all the requests are just bombing out entirely. So, the solution that we came
up with, how many of you guys were in the talk just before this, at Brett’s talk next
door? So a few guys, so, one of the things I thought really interesting example of the
new task queue was the backend sort of writing to a cache. We implemented something–similar
concepts. Dave talked a little about using memcache cleverly on read and this is using
memcache on write which in the Google group discussions in other places, people are going
to say “Oh, don’t do that, you know, memcache unreliable. Things might go away.” But in
my opinion, there are certain cases and this is one of them for us where you don’t absolutely
need to have exact accuracy. If you’re writing a banking application and counting people’s
pennies, you know, don’t do this. But for a lot of cases, if you’re just getting a general
idea of a quota, how many things are, you know, generally around at a given time. And
again, knowing things might fail, you have other processes to come in place and check
things happened accurately, do things twice and so forth. Using memcache on write, it
can be a really interesting idea. So this diagram, basically, it shows that request
comes in with the API rather than writing your sharded counter, data store counter,
we write to the sharded memcache counters. When the memcache counters fill up to 100,
to 100 whatever, only then do we write that bit of information to the data store. And
then, the task queue is a little bit nicer than this, and the talk just before this,
you know, just between the memcache and the data store, there’s this task queue and it’s
a lot faster to write to the upcoming task queue that we certainly weren’t expecting
when we started writing this. But, you still have the same issue, writing to a task queue
or writing to the data store which is just this small time period where request can come
in, get your memcache shard, increment the count before you’ve written and then cleared
the memcache shard. So, you just have to sort of keep track of the data–of what you’re
writing to the data store and then the amount of things that came in before you got confirmation,
same with task queue. So, as I mentioned, the solution is really nice, the advantage
is really nice, you get, instead of, you know, a hundred data store, you know, hits a second,
you get a couple of orders of magnitude less and that just scales up really, really, really
nicely. But the disadvantage is there is, I don’t know where my disadvantage went? But
there’s a possibility you could lose a little bit. And so for our use case, we were okay
if it was off a little bit, as long as we didn’t over count. We didn’t want to accidentally
set some of our quota that they didn’t use and so we just really, really careful that
in case of a loss of data, it’s always a subtle under count. And again, memcache has proven
really, really reliable, there’s a couple of, you know, two or three hour knockouts
that we had that we just lost counters for. But, aside from that, as a general rule, you
can pretty much rely on it. So, after we implement the memcache, things went like this, and again,
this graph’s from a while ago, these days, it’s smoother and things stay well under a
second for the response time, but it’s really a huge difference. What it means is we can
actually count stuff now at a very, very high rate per second, and not have to worry about
hitting these issues where our core functionalities compromise. So the summary here is, it’s not
necessarily a bad idea to think about using memcache on write. Again, possibility of glossiness
is always there but, in many cases that you might come across, it can really save not
only the possibility of this data store contention but also some money because you charged for
the CPU time on the data store, and if you’re not in the data store, you know that can add
up a few bucks a day overtime. And then, the other piece is early on decide; is this functionality
a good fit for App Engine? Certain things, for example, we could’ve offloaded the quota,
parsed logs every couple of hours offline, come back on. And we decided “Oh, this is
a good piece to fit on App Engine, we ought to be able to count stuff.” Other things we
decided, “No, this piece, although integrated with App Engine and the API, maybe when we
box, so just deciding which pieces fit in App Engine is important. And then the final
one is, of course, when you’re putting those pieces into App Engine, make sure that you
write them in such a way that should strange and weird things happen, like all of a sudden
you get popular, 80 queries a second, data store contention hits. Those features that
you wanted to have an option, don’t compromise the really core functionality that your application
may have. And so with that, we’ll turn it back to Jesse and talk about some of the overall
results.>>KOCHER: So, you’ve seen some of the, you
know, stair steps to scalability hotness that Dave went through, where has all that gotten
us? Well, the three of us, a very small team have built an App Engine API that handles
over 1.5 million requests a day, and peak rates up to about 80 requests per second,
it’s quite a bit higher than where we started. And we’ve come to think of App Engine as really
being a black box. It’s a place where things may shift in, you know, unexpected ways internally
and you may not always know what’s going on in there. But it’s also very well documented
and not just the main documentation online but also this dynamic support with IRC channel,
the Google groups, the issue tracker, and the App Engine team is really free, they frequent
all of these places. So the degree of communication is pretty amazing and it makes working inside
of a black box considerably less daunting. So, our amended statement would be, it’s a
very good black box. I think looking back, we would make the same choice that we made,
you know, not knowing a lot of the things that we’ve learned along the way, and I can
say that having watched App Engine progress over these last few months, the work that
we have done would be considerably easier starting now than it was starting back then.
So we hope that this talk has helped you understand some of the risks and rewards of using App
Engine and some things about how you get App to scale well. We have, you know, put up our
contact info for a second here, so if you want to get in touch with us, you can jot
that down. We will also, got some time for questions here and we also will be out of
the demo spot for most of the rest of the afternoon until about 4 o’clock, so if you
want to catch us there, please do. And we’ll move into Q and A now, so, we’re happy to
take questions on anything. We’ll take them at the mic and also, I’ll flip over to the
moderator side in a few minutes. But before I do that I want to put up, these are a few
things that we didn’t put in to our talk but we definitely have more to say about, so if
any of those interest you, feel free to ask about those. And then, Google sent us a note
earlier today asking us to direct you all to haveasec.com/io. If you go there, you put
in the time slot and the session and you can give feedback so we’d appreciate any feedback
you have for us. And with that, we will to move to questions so, go ahead.
>>Two part question about your counters. Why did you still shard them when you put
them on memcache?>>KOCHER: There actually can be a contention
with memcache in that, when you’re hitting it very intensively.
>>How many request per second when you put that or…?
>>LIVNI: Well, we found that once we hit our, there’s a couple of reasons we sharded
the memcache counter originally. One is there is that possibility of contention, and the
other is it’s cheap to make lots of memcache shards, and one of the issues I mentioned
in that slide is that as you’re writing the shard to the data store, you might not get
a response back for a half second or four seconds. So, all of that time, you’re having
a lot more requests coming in. We wanted to minimize to some degree in order to keep the
count a little–as accurate as possible, that very, very small timeframe from when we clear
the memcache shard as, “Okay, we got a confirmation back” and then we’ve written it out. We don’t
want 30 new requests coming in in that timeframe because there is still a little bit of latency
even writing to memcache. It’s, you know, much, much faster writing to data store and
so by making a lot of shards, it’s not really any overhead for the system that we can sort
of spread out the number of requests in that just a couple of milliseconds, it’s less likely
we’ll lose something. Does that make sense?>>Okay. The second part is, before this morning,
I, you know, I didn’t really know how many memcache instances Google would have for my
App. But now, it sounds like I have to assume if I write to a key, it will always be the
same one memcache because, otherwise, these counters wouldn’t work. You know, if I have
one instance of my Java App running in Australia…>>PECK: No, that’s correct, memcache is a
distributed API and so, if you write on, you don’t have any knowledge of what CPU your
code is executing on inside the App Engine data center.
>>Right. But now, it sounds like I have to guarantee that when I write to memcache, all
my 500 App instances write to the same memcache instance.
>>PECK: Yeah, you’re effectively writing to the same store. That’s correct.
>>Do you think that will be the future proof?>>PECK: I don’t know they propagate that
data across, I don’t know how long it takes so, if you’re running a code, not that you
can specify when you’re writing an App Engine application right now, but if some of your
codes are running in Australia and some of it is running here, I don’t know how long
that would take to propagate. But, yes, it looks like you see a consistent view of the
memcache world regardless of where your code is.
>>Just surprised that that’s future proof because, you know, once you get to the Facebook
scale, maybe you guys, you need multiple memcache instances.
>>PECK: Yeah. Yeah.>>For the same write.
>>LIVNI: Yeah, I mean, we’ve had good luck with it but I’d say that’s a really good question
for the folks who actually built it would have the better answers than we would.
>>KOCHER: They also have, you know, the memcache has a certain size and, you know, the more
you put in there, the more you increase pressure on your memcache and the sooner things will
expire from it. So, there’s really a balancing act about figuring out which things to put
in there and, you know, the size of your entities matters when you’re putting things in. So,
choosing that, you’re making a lot of decisions that will affect how available that data is
when you go back to look for it later.>>LIVNI: I think there’s one actually a small
thing I should mention on that which is that when you write to a memcache, I believe it
gets deleted, first created first out, not last updated, out. And so, when we refresh
our memcache key, I believe we delete the key and then recreate it, rather than just
updating it. So that way, it’s got a fresher date and is less likely to get migrated should
more contention happen in memcache.>>I have a couple of questions. First, it’s
about your API. You sort of had a very simple diagram but the queuing that you do when you
don’t have the latitude and longitude, does your API just are done queued up? Like, I
mean, in case, the queries for a latitude and longitude that you haven’t calculated,
do you just redone queued up?>>PECK: Yeah, that’s right. Our API, we tell
our customer try back later basically. If you think about the typical use case of embedding
in a real estate site, what that means is a customer doesn’t display the Walk Score
at that time. But the next time one of their users comes and looks at that same property,
we’ll probably have that Walk Score queued up and ready for them.
>>Okay. And the other question is about the, so, you use this metric of queries per second.
So, when you were describing the techniques to speed up like using memcache and everything.
>>PECK: Yeah.>>So, these queries per second I’m assuming
is like data store queries that you’re saying, right?
>>PECK: Oh no, when I was showing those queries per second numbers, what I was showing is
the number of your successful URL requests that someone outside of our, of the App Engine
data center can make through our application and get running correctly. And those numbers
were rough, they won’t apply to your application exactly but that’s approximately what we saw
in the development of the Walk Score.>>LIVNI: They’re also a little bit better
now.>>PECK: Yeah, they really are.
>>LIVNI: I think if you would write the same App that we did early on, you would get, I
think, a much higher queries per second today than eight months ago.
>>PECK: Yeah, I mean, the bottom line message is that with even with naive code, you can
actually scale how pretty far with that, that’s just pretty impressive.
>>I have two questions. The first one is: are you now running your entire production
front end off of App Engine?>>PECK: Yes, we do.
>>So, if we go to www.walkscore.com is actually…>>KOCHER: No, great. No, we’re only doing
the API there. We are looking at, we’re considering moving other parts that, you know, either
the tile or potentially even the whole website over and you could mention quickly the other
stuff.>>PECK: Yeah, yesterday. So, that the Walk
Score website is also actually written in PHP and as you know JBM now on App Engine
and there’s a company here called Calcio which makes a PHP implementation for the JBM. And
actually yesterday, we just, we’re able to port our entire PHP website over to App Engine,
and we just started testing it but it looks like it’s extremely performant which is impressive.
It’s a lot faster than running Apache on a private server at some random ISP. Yeah, the
API itself, however, is all of course, Python App Engine.
>>And the second question is, have you run into any of the big limits of App Engine in
terms of the storage and Nexus? You’ve mentioned in your last slide, have there been any big
things where you end up having to be billed for a lot more than you expected?
>>KOCHER: We haven’t. They gave us, you know, as early users of early high traffic users.
They gave us access to go beyond the, those initial free quarters, you know, before the
billing was enabled. Now that billing is enabled, there’s a lot more flexibility to go, you
can use a lot, so we haven’t really run into any of those limits. I think the things that
are interesting are the per request limits of, you know, how much CPU time you can use
and how much time you can use, and we have some of the EC2 to App Engine communication.
Some of those things do, they use way more CPU time than App Engine is really happy with,
but because they’re are much less frequent than the public calls that come in at a much
higher rate. We can manage those and we use some backing off techniques where if EC2 has
trouble communicating, it can say, “Oh, we’re going to stop talking to App Engine for a
little while and stop hitting it with this high CPU intensive request.”
>>PECK: Do you want to take this?>>KOCHER: Sure.
>>PECK: So, considering that calculating Walk Scores requires considerable time, would
consider using web hooks? And, actually, I was in Brett’s talk just previous and the
answer is yes, I think we’d love to use web hooks. We haven’t used them yet, but it might
be something for us to look at. And certainly, for those who were there, I think you probably
guessed that a lot of our calculator code fits nicely with the task queue API that he
described.>>KOCHER: Yes, go ahead.
>>Could you please elaborate a little bit into how you implemented the, how you implemented
doing the queries of your App Engine and the computational intensive part of Amazon EC2?
>>PECK: Sure, I mean, I can dig a little more into the design of the calculator. Basically,
the calculator is currently, well, okay, we have a bunch of other stuff. So, the calculator
is, of course, completely running on at EC2. And it’s actually running on a single machine
but it’s lots and lots of processes.>>KOCHER: Do you want to jump this thing
here?>>PECK: Oh, actually I wanted to jump to
that one. And so, the key thing that we saw because of the unpredictable performance reading
from the queue that we’ve implemented on the App Engine side is that we wanted to decouple
the computation of the Walk Score from the I/O request we’ve made. So, what we have is
a master process that spawns up a ton of slaves. Some of those slaves are responsible for talking
to App Engine and requesting new latitudes and longitudes that we need to calculate.
And what we do is we have a rather large buffer on the Amazon EC2 site. A lot of latitudes
and longitudes coming in, so that if we can’t talk to App Engine for a while, we can continue
to calculate scores, so that’s one major thing. We basically decoupled our I/O from our computation
on the EC2 side and we buffered all our I/O. And sort of one last point which Jesse alluded
to you is that we dynamically respond to the changing I/O conditions, so, if App Engine
is, if contentions are very high for more than, say, 10 or 15 requests in a row, what
we do is we actually back off and stop talking to App Engine for a while. And for various
reasons based on the internal design of the data store, that’s something that can be alleviate
contention if we come back 20 minutes later, and just work through the buffer, the data
that’s buffered on the EC2 side. So, that’s sort of the big picture of the design of the
calculator.>>All right. So, it was roughly one database
and you used RSync or something like that between…
>>PECK: Oh, so actually App Engine is our data master. All of our data is there. All
of the points that we’re working on, on the EC2 side are simply held in RAM. If that process
crashes, it’s not the end of the world. We basically might read you a few points. We
actually, the EC2 code is also Python. So, if we accept out of a process, we actually
just pickle out all the current data, so.>>Okay. Thanks.
>>KOCHER: Any other questions?>>PECK: Any other questions?>>KOCHER: That was the only…
>>LIVNI: I think that was it.>>PECK: Oh, thank you.
>>LIVNI: Thanks, guys.>>KOCHER: Yup. Thanks.

One comment on “Google I/O 2009 – App Engine: Scalability, Fault Tolerance..”

  1. Ben Hirashima says:

    after 8 minutes, still not even a mention of google app engine. this is more like an advertisement for this guy's website.

Leave a Reply

Your email address will not be published. Required fields are marked *