At Financial Engines, we have been working towards decomposing the monolith and working towards a microservices based architecture. This architecture would allow us to scale and adopt a polyglot programming model where we can choose the best technology not just the technology that fits.
To continue to scale and meet our partners’ requirements, we also needed to expose our API’s externally. As part of that effort, we went through an API gateway evaluation and implementation journey. During that effort, we learned many things and want to share them with you.
Having built a monolith system and gone through the pains of trying to externalize it, we wanted to use API first development model. API first development model typically involves starting with an API based design and exposing internal and external microservices via API. This helps in reducing tight coupling between different subsystems and allows each subsystem to evolve independently.
In our case, our API’s are getting consumed by different internal and external client’s such as internal microservices, Single Page Applications (SPAs), internal mobile apps, and partners.
Selecting a right API gateway product is very important because it really becomes the central component to your infrastructure. There are many enterprise API management solutions available with varying levels of features and functionality. API gateway product selection involves understanding your current and possible future requirements and then selecting the product best suited for those needs.
When we started the API gateway evaluation 2 years ago, we were looking for the following features:
- Our ecosystem fit such as an ability to interface with our Identity and Access Management (IAM) solution
- Mature gateway features
- Ease of scale
- Minimal latency overhead
- Out-of-box analytics and support for custom analytics
- Cost of ownership
- Good developer community.
Of the products we evaluated, Apigee Edge was the most feature complete. In addition to having all of the features listed above, it had the added benefit of being a hosted service. Therefore we wouldn’t need to maintain it. We also evaluated AWS API gateway and realized that it was pretty new at that time and we knew it would take another few years to mature. Fast forward two years, AWS API gateway has added many cool features but is still maturing. However, we do recommend AWS API gateway for organizations whose infrastructure is already in the AWS cloud since it really helps to simplify overall architecture in AWS.
Once we had selected Apigee, the next step was to hook it up to our IAM solution. In our case, Apigee supported OAuth2 framework out of the box. We developed a wrapper around Apigee OAuth2 implementation which helped us to support additional grant types and hook up with IAM solution for existing grant types such as password grant type. We added additional grant types to give the token in exchange of SAML.
Apigee supports the ability to attach additional information to oauth tokens. So, after authentication, it is easy to extract and pass information to the back-end resource servers as JSON web tokens. We also used OAuth2 scopes to restrict access to our API’s based on the client. This feature was quick to implement thanks to Apigee’s cool user interface.
oAuth2 has been the de facto industry standard for authorization. OpenID connect protocol builds on top of oAuth2 to provide authentication support. In our case, we had many situations where standard grant types offered by oAuth2 were not applicable and we added extension grant types. We suggest using a combination of OpenID connect, OAuth2 and extension grant types to solve your AuthN and AuthZ needs.
Test Driven Development (TDD) and Test Automation are key elements of any software implementation.
Starting from day 1, we built unit tests, service tests, integration tests and smoke tests which really helped the API gateway team to develop with confidence. Many times failing tests saved us from potential incidents and demonstrated the value of thorough automated testing. What about performance tests? As you add new functionality to your API gateway layer, it could potentially add more latency than what you might expect. Defining SLAs and continuously monitoring your performance will greatly reduce production issues.
We also had health-checks for each service which helped us to make sure that the API gateway and services were hitting our HA goals.
Having a Continuous Integration and Continous Delivery (CICD) pipelines helps speed up adding microservices to an API gateway.
Once we were ready with our API gateway infrastructure, the next step was to onboard our different microservices to the API gateway. We used the pass through mechanism so we only had to have a single proxy in Api which intercepted the request and redirected to different microservices based on the base path.
The benefit of this was that we didn’t have to duplicate the proxy code for each of our back-end services. This helped us stay lean. However, we learned a lot as part of this and realized that it depends on the enterprise requirements as to which strategy would work. Using the base path approach, we quickly implemented, but we should have slowed down and done more thorough URL resource modeling. Spending more time on the design phase would have resulted in a better REST resource model. This leads to the next point, get your resource modeling right
Defining your APIs correctly is critical since it is very difficult to change them later. This is even more critical with mobile and external customers. For example, once you release the mobile app, it’s hard to make the breaking changes in the API’s as forcing users to upgrade is generally considered bad practice.
When we started exposing our APIs initially, we used standard REST resource modeling using verbs and nouns etc. We also exposed each service API’s directly via API gateway. An effort is in progress to rationalize the APIs as per domain driven practices so that to end clients it always appears to be domain specific endpoints irrespective of from which microservice response is getting served.
As you advance in your API gateway implementation journey, you will find the need for one microservice to call another microservice. This results in many questions that need to be answered. How does one service know about other services? What will be the authentication model between one service and another? If you decide your services to call other services via API gateway, how will credentials be passed to services? How can you reduce the latency between your internal microservices by not going through API gateway?
We answered these questions on a service by service basis. For services which were already exposed outside, we allowed other services to call them using our API gateway. This had the benefit of using the same auth mechanism for all the clients. For internal services which were not exposed through an API gateway, we have our own custom auth implementation and we will have a look at this again once it requires our attention.
Redundancy is very important to achieve High Availability(HA). If you are not HA, it may lead to a single point of failure. In software engineering, there’s a saying that a single point of failure will always fail when you expect it the least. We learned this hard way because the web server that was responsible for mutual auth from cloud API gateway was not HA from day 1, and this resulted in a production outage. We learned from our mistake and are not hoping to repeat it.
Distributed tracing becomes critical for troubling shooting issues in distributed systems.
Knowing the importance of distributed tracing, we implemented logging in each component and aggregated the logs to our centralized Splunk log server. We used an Apigee generated UUID for each request to trace it through the services. We also return the UUID in the response so that API consumers could use it for their own troubleshooting. Using a single ID allows us to coordinate when troubles arise. We also had requirements for session tracking which we implemented using additional session header.
Analytics is very important to any system. Apigee provides many reports out of the box such as traffic by the developer app, traffic by device type, overhead added by API gateway, total and average response time to process a request, total errors generated by back-end server, the time taken to process the request by the back-end server for each URI etc. We also created custom reports using our custom logging. Our custom logging helped us to get high-level stats aggregated based on api success/failures, get the breakdown of success/failure for each client, SPA’s, services etc. This also helped us to set alerts for any deviations. Most common fields which we log are environment, HTTP method, URI, total_response_time, client_app_name, is_gateway_error, is_backend_error, http_response_code etc.
Finally, whenever in doubt, start simple and iterate on the solution. Avoid the trap of building for hypothetical situations if you really don’t know if the situation will occur. This mantra really worked for us.
To summarize, here are our suggestions -
- Use API First Development Model
- Be Careful In API Gateway Selection
- Define Your Authentication And Authorization Strategy
- Implement Automation Testing Strategy
- Make Onboarding Microservices Seamless
- Get Your Resource Modeling Right
- Define How Service To Service Auth Would Work
- Develop Redundancy From Day One
- Have A Good Distributed Tracing Strategy
- Use Analytics
There are many other features which we have not touched as part of this blog such as handling SPA’s, refresh tokens etc. We will cover that as part of another blog post. We hope that some of the things we’ve learned will help you out in evolving your API gateway platform.