Final consistency scheme for data in ServiceComb
This article is issued by the &&ServiceComb community authorized by the HUAWEI micro service engine technical team.
Data consistency is an important issue to be considered in building a business system, and we used to rely on databases to ensure consistency of the data. But it is a challenging problem to achieve data consistency in the micro service architecture and in a distributed environment.ServiceCombAs an open source micro service framework, it is committed to solving the problems in the process of micro service development. We recently launchedServiceComb-SagaThe project solves the problem of the final consistency of data in the distributed environment. This article will show you why data consistency is so important? What is Saga?
Data consistency of single application
Imagine if we run a large enterprise with an airline, a car rental company, and a chain hotel. We provide one-stop travel planning services for our customers, so that customers only need to provide destinations, and we help customers book air tickets, rent cars and book hotels. From the business point of view, we have to ensure that the three service reservations are completed to meet a successful tour, otherwise we can't do it.
Our monomer application to meet this requirement is very simple. We only need to put the three service requests in the same database transaction. The database will help us to ensure all the success or all rollback.
When this function is online, the company is very satisfied and the customer is very happy.
Data consistency in a micro service scene
In the past few years, our travel planning service is very successful, the enterprise is thriving, and the number of users has doubled. Subordinate airlines, car rental companies and chain hotels have launched more services to meet customer needs. Our application and development teams are also growing. Now our single application has become so complex that no one knows how the entire application works. What's worse is that the new feature line now needs all the R & D teams to work out for weeks and nights. As the market share has gone from bad to worse, the company is not satisfied with the R & D department at the top of the company.
After several rounds of discussion, we finally decided to a large single application of four: ticket booking service, car rental, hotel reservation service, and payment services. The service uses its own database and communicates through the HTTP protocol. The team responsible for the services will be launched in accordance with the market demand in accordance with its own development rhythm. Now we are faced with new challenges: how to ensure that the first three services have been booked up to meet a successful tour schedule, or can not be established business rules? Now services have their own boundaries, and the selection of database is not the same. The solution to ensure data consistency through the database is not feasible.
Luckily we found a wonderful article on the InternetpaperThe data consistency solution proposed in this paper Saga exactly meets our business requirements.
Saga is a long-lived transaction, can be decomposed into sub transactions can be interleaved operation set. Each of these subtransactions is a real transaction that maintains the consistency of the database.
In our business scenario, a travel planning transaction is a Saga, which contains four sub transactions: airline reservation, car rental, hotel reservation, and payment.
Chris RichardsonIn his articlePattern: SagaSaga has a description of it. Caitie McCaffrey is also in herspeechHow about Microsoft?Halo 4How to use saga to solve the problem of data consistency in the game.
The operating principle of Saga
Transactions in Saga are interrelated and should be executed as (non atomic) units. Any unfulfilled Saga is not satisfied, and if it happens, it must be compensated. To fix the part that is not fully executed, each saga sub transaction T1 should provide the corresponding compensation transaction C1
We define the following transactions and their corresponding transaction compensation according to the above rules:
When each saga sub transaction is T1, T2,... Tn has the corresponding compensation definition C1, C2,... Cn-1, then the saga system can guarantee 
- Subtransaction sequence T1, T2,... Tn is finished (best case)
- Or sequence T1, T2,... Tj, Cj,... C2, C1, 0 < J < n, completed
In other words, through the above defined transaction / compensation, Saga guarantees that the following business rules are met:
- All the reservations are executed successfully, and if any one fails, it will be canceled.
- If the final payment fails, all the reservations will be cancelled
The recovery mode of Saga
Original paperTwo types of Saga recovery methods are described in this paper.
- Restores all completed transactions, if any subtransaction fails
- Retry the failed transaction forward and assume that each sub transaction will eventually succeed
Obviously, there is no need to provide compensation transactions for forward recovery. If your business transactions are always successful or the compensation transaction is hard to define or impossible, forward recovery is more in line with your needs.
The theory of compensating transactions never fail, however, in the distributed world, the server may be down, the network may fail, even the data center may also be blackout. What can we do in this case? The last measure is to provide back measures, such as manual intervention.
Conditions for using Saga
Saga looks very promising to meet our needs. All living affairs can do? There are some restrictions here:
- Saga allows only two levels of nesting, top-level Saga and simple child transaction 
- In the outer layer, all atomicity can not be satisfied. That is, sagas may see some of the other sagas results, 
- Each subtransaction should be an independent atomic behavior 
In our business scenario, flight reservation, car rental, hotel reservation and payment are natural independent behaviors, and every transaction can use the database of corresponding services to ensure atomic operation.
We do not need atomicity at the planning level. One user can reserve the last ticket and then be cancelled due to a lack of credit card balance. At the same time, another user may start to see that there are no more tickets. Then, because the former reservation is cancelled, the last ticket is released, and the last seat is grabbed and the itinerary is completed.
Compensation also needs to be considered:
- The compensation transaction withdraws the behavior of the transaction Ti from the semantic point of view, but it may not be able to return the database to the state when the Ti is executed. (for example, if a transaction triggers a missile launch, it may not be able to revoke this operation)
But this is not a problem for our business. In fact, unrevocable behavior may also be compensated. For example, a transaction that sends an e-mail can be compensated by sending another email to explain the problem.
Now we have a solution to the problem of data consistency through Saga. It allows us to perform all the transactions successfully, or to compensate for the successful transactions when any transaction fails. Although Saga does not provideACIDIt is guaranteed, but it still applies to many scenarios where data is finally conformance. So how do we design a Saga system?
Saga ensures that all subtransactions can be completed or compensated, but the Saga system itself may also collapse. When Saga crashes, it may be in the following states:
Saga received a transaction request, but it has not started yet. The micro service state corresponding to the factor transaction has not been modified by the Saga, and we do not need to do anything.
Some subtransactions have been completed. After restarting, the Saga must resume the transaction that was last completed.
- The subtransaction has begun, but it has not been completed. Because the remote service may have completed the transaction, or the transaction failed, even the service request timed out, Saga could only restart the sub transactions that had not been confirmed before. This means that the subtransaction must be idempotent.
- The subtransaction failed, and its compensation transaction has not begun. The Saga must execute the corresponding compensation transaction after the restart.
- The compensation transaction has begun but has not been completed. The solution is the same as the last one. This means that the compensation transaction must also be idempotent.
- All subtransactions or compensation transactions have been completed, the same as the first case.
In order to recover to the above state, we must track every step of the subtransaction and the compensation transaction. We decide to meet the above requirements through an event and save the following events in a persistent storage named saga log:
- Saga started event saves the entire saga request, including multiple transaction / compensation requests
- Transaction started event saves corresponding transaction requests
- Transaction ended event saves the corresponding transaction request and its reply
- Transaction aborted event saves the reason for the corresponding transaction request and failure
- Transaction compensated event saves the corresponding compensation request and its reply
- The Saga ended event marks the end of the saga transaction request and does not need to save any content
By persisting these events in Saga log, we can restore the saga to any of the above states.
Due to persistent Saga only need to do event, and the event content stored in the form of JSON, Saga log to achieve a very flexible, database (SQL or NoSQL), persistent message queue, even ordinary files can be used for event storage, of course, some can quickly help the recovery saga.
The data structure of the Saga request
Under our business scenario, flight reservation, car rental and hotel reservation are not dependent on each other, and can be processed in parallel. But for our customers, it is more friendly to pay once after all the successful booking. Then the transaction relationships between the four services can be expressed in the following diagram:
The data structure of the travel planning request is implemented asDirected acyclic graphIt's just right. The root of the graph is the saga startup task, and the leaf is the end task of the saga.
As mentioned above, flight reservations, car rentals and hotel reservations can be handled in parallel. But doing this can cause another problem: what if a flight reservation fails and a car rental is being handled? We can't wait for the car rental service to respond, because we don't know how long it needs to wait.
The best way is to send a car rental request again and get a response so that we can continue the compensation operation. But if the car rental service never responds, we may need to take back measures, such as manual intervention.
The timeout reservation request may eventually be received by the car rental service, when the service has already processed the same reservation and cancellation request.
Therefore, the implementation of the service must ensure that the corresponding transaction request received again is invalid after the compensation request is executed. Caitie McCaffrey called it an exchangeable compensation request (commutative compensating compensating) in her speech, Distributed Sagas: A Protocol for Coordinating Microservices.
ACID and Saga
ACID is a consistency model with the following attributes:
- Atomicity (Atomicity)
- Conformance (Consistency)
- Isolation (Isolation)
- Persistence (Durability)
Saga does not provide a ACID guarantee, because atomicity and isolation can't be met. The original paper is described as follows:
Full atomicity is not provided. That is, sagas may view the partial
Through saga log, Saga can guarantee consistency and persistence.
Finally, our Saga architecture is as follows:
- Saga Execution Component parses the request JSON and builds the request diagram
- TaskRunner ensures the execution order of the request with the task queue
- TaskConsumer handles the Saga task, writes the event to the saga log, and sends the request to the remote service
In the previous article, I talked about how Saga is designed under ServiceComb. However, there are other data consistency solutions in the industry, such as the two phase submission (2PC) and the Try-Confirm / Cancel (TCC). What's special about saga compared to that?
The two phase submits Two-Phase Commit (2PC)
The two stage commit protocol is a distributed algorithm, which is used to coordinate all processes involved in the distributed atomic transaction, so as to ensure that they complete the commit or abort transaction.
2PC contains two stages:
The voting phase coordinator initiates a vote request to all services, and the service answers yes or no. If any service is returned to no to reject or timeout, the coordinator sends the stop message at the next stage.
At the decision stage, if all services reply to yes, the coordinator sends commit messages to the service, and then the service tells the transaction to be completed or failed. If any service submission fails, the coordinator will start additional steps to abort the transaction.
Before the end of the voting phase and the end of the decision stage, the service is in a state of uncertainty because they are not sure whether the transaction continues. When the service is in an uncertain state and is out of connection with the coordinator, it can only choose to wait for the recovery of the coordinator, or consult other decisions to identify the coordinator in the determined service. In the worst case, a n in a service that is in an uncertain state to other N-1 services will produce O (N2) messages.
In addition, 2PC is a blocking protocol. The service needs to wait for the coordinator's decision after the vote, and the service will block and lock the resource. Because of its blocking mechanism and the worst time complexity, 2PC can't adapt to the need to expand with the increase in the number of services involved in the transaction.
More details about 2PC implementation can be referred to 2 and 3.
TCC is also a compensatory transaction model that supports the two phase of the business model.
- In an attempt phase, the service is placed in a pending state. For example, when receiving the trial request, the flight reservation service will reserve a seat for the customer, insert the customer reservation record in the database, and set the record as a reserved state. If any service fails or timeout, the coordinator will send a cancellation request at the next stage.
- The service is set to a confirmation state at the confirmation stage. It is confirmed that the request will confirm the reservation of the customer, and the service can charge the ticket to the customer. The customer reservation record in the database will also be updated to the confirmation state. If any service cannot be confirmed or timed out, the coordinator will retry the confirmation request until it is successful, or take a rollback measure after retrying a certain number, such as manual intervention.
Compared with saga, the advantage of TCC lies in trying to turn service into the state of processing rather than the final state, which makes it easy to design corresponding canceling operations.
For example, the attempt request of e-mail service can mark the message as ready to send, and send mail only after confirmation, and the corresponding cancellation request is only to mark the message as obsolete. But if saga is used, the transaction will send an email and its corresponding compensation transaction may need to send another email to explain.
The drawback of TCC is that its two phase protocol needs to design additional service to be processed, and additional interfaces to handle trial requests. In addition, the time that TCC processes transaction requests may be two times that of saga, because TCC needs to communicate with each service for two times, and its confirmation stage can only start after receiving all services' response to trial requests.
More details about TCC can be referred toTransactions for the REST of Us.
Event driven architecture
Like TCC, in an event driven architecture, each service transaction involved need to support the long pending additional. The service that receives the transaction request will insert a new record in its database, set the record state to be processed and send a new event to the next service in the transaction sequence.
Because in the insert records after service could collapse, we cannot determine whether a new event has been sent, so each service needs extra event table to track the current long transaction in which step.
Once the last long service in the affairs of the sub transaction, it will notify it in the transaction before a service. The service that receives the completion of the event sets the state of its record in the database to complete.
If carefully compared, the event driven architecture is like a non centralized event based TCC implementation. The architecture is like a non centralized event based saga implementation if the service record is directly set to the final state by removing the pending state. De centralization can achieve service autonomy, but it also creates a closer coupling between services. It is assumed that new business requirements are added to the new process D between service B and C. In an event driven architecture, the service B and C must change the code to adapt to the new process D.
Saga is on the contrary, all of these are coupled in the saga system, when adding a new process in the long transaction, the existing service does not require any changes.
More details can be referred toEvent-Driven Data Management for Microservices.
Centralized and non centralized implementation
This Saga series of articles discussed centralized saga design. But saga can also be implemented with a non centralized solution. So what's the difference between a non - centralized version?
The non centralized saga has no full-time coordinator. The service that starts the next service call is the current coordinator. For example,
- Service A receives request for transaction requests that require data consistency between A, B and C.
- A completes its subtransactions and passes the request to the next service in the transaction, service B.
- B completes its subtransactions and passes the request to C, and so on.
- If the C processing request fails, B is responsible for starting the compensation transaction and requiring A to roll back.
Compared with the centralized, the non centralized implementation has the advantage of service autonomy. But each service needs to include a data consistency protocol and provide the additional persistence facilities it needs.
We tend to be more autonomous business services, but also related to complexity of the service for many applications, such as data consistency, service monitoring and message passing, will these thorny problems of centralized processing, business services can be released from the complexity of the application, focusing on the processing of complex business, so we use a centralized saga design.
In addition, as the number of services related to the long transaction growth, the relationship between services is becoming more and more difficult to understand, will soon appear below the Death Star shape.
At the same time, the problem in the long transaction location has become more complex, because the service log across the cluster nodes.
This article compares saga with other data consistency solutions. Saga is more extensible than the two phase submission. In the case of transactional compensation, compared to TCC, saga has almost no change in business logic and has a higher performance. The centralized saga design decouples the service and data consistency logic and its persistence facilities, and makes it easier to troubleshoot issues in the transaction.
Original Paper on Sagas by By Hector Garcia-Molina & Kenneth Salem
Original Paper on Sagas by By Hector Garcia-Molina & Kenneth Salem
Gifford, David K and James E Donahue, Coordinating Independent Atomic Actions.
ServiceComb Saga Project:Https://github.com/ServiceComb/saga
ServiceComb official website