Subscribe cloud computing RSS CSDN home> cloud computing

Cloud Computing Core Technology Architecture Forum (a): build a high availability, high expansion, easy operation and maintenance of the Cloud Architecture

Published in17:56 2015-06-08| Time reading| sourceCSDN| ZeroArticle comments| authorZhou Jianding

Abstract:IBM Yang Haiming, Sina Cong Lei, Qingyun Yang Jintao, Intel Hu Xiaohe seven between Niu and Li Corps six experts from analytical framework, practice platform, technology research and development of practical application, etc. many different aspects of shared how to construct a high availability, scalability, ease of operation and maintenance of the cloud architecture, to support enterprises to enjoy the demand for cloud computing services.

In June 4, held in the afternoon "Cloud Computing Architecture Core Technology Forum (a)", IBM China wisdom cloud platform with openstack Yang Haiming, chief architect, Sina cloud computing the total person in charge, Cong Lei, founder of SAE, baidu Open Cloud Yang Yi, a senior architect,, Intel cloud computing and large data Qingyun qingcloud System Engineer Yang Jintao solve solution architect Hu Xiaohe seven cattle cloud storage chief architect Dao Bing Li and other six lecturer from analytical framework, practice platform, technology research and development of practical application, etc. many different aspects of shared how to construct a high availability, scalability, ease of operation and maintenance of the cloud architecture, to support enterprises to enjoy the demand for cloud computing services.

IBMChina wisdom cloud platform and OpenStackChief architect Yang Haiming: Open Cloud Architecture +Enterprise enhancement

The first lecturer is IBM China wisdom cloud platform with openstack Yang Haiming, chief architect, he first said: Constructing Enterprise Cloud Platform in three ways, only use openstack, KVM, a large variety of open source software build, may pre deployment on-line faster, the latter part of the maintenance is a nightmare; and if the traditional commercial program, the response to the changing needs of business and relatively slow; therefore, more reasonable way is the intermediate program - Open Architecture and unique enterprise enhancement is achieved.

IBM cloud computing platform to fully support open standards, including OpenStack, Docker, CloudFoundry, etc.. Yang Haiming introduces IBM to the five directions: Based on power and openpower to construct an open cloud, based on power Linux provides open and high performance of the cloud platform, committed to open interoperability between the openstack cloud, is committed to openness of container technology and bluemix to build eco system based on.

IBM China wisdom cloud platform and OpenStack chief architect Yang Haiming

Yang Haiming said power architecture to support the use of FPGA and GPU to accelerate the promotion of cloud platform and its upper database and application performance, and full support for Ubuntu and red hat, SUSE and so on Linux platform, powerkvm is enhanced by dividing nuclear technology to realize the scheduling simultaneously on the same core, a plurality of VM, the depth of excavation can be P8 multithreaded performance. OpenStack cloud between the open exchange, IBM from three directions: open, modular design, multi vendor collaboration, as well as rapid innovation. IBM through the support of the OpenStack foundation official interoperability test certification, to address the portability of the OpenStack ecosystem, to achieve a more flexible resource allocation and billing standards. Openstack cloudy management, not only the integration of IBM, HP, mirantis openstack, also integrate the VMware, through integrated management of SDC, SDS, SDN and physical resources and virtual resources, all private environment, MSP and public cloud resources dynamic combination of unified management, and thus to the mixed cloud.

Dynamic hybrid cloud based on OpenStack

In the Kilo version, the IBM implementation of the OpenStack enterprise environment enhancements include:

  • Enhanced Metal Bare (Ironic) and integration of professional cluster management tools
  • File system (Manila) and the integration of a unified storage system
  • Integration of enterprise Service Scheduling
  • Enhancement of enterprise internal installation and deployment
  • Enterprise resource management enhancement (Power, Mainframe, VMware)

Yang Haiming believes that while supporting the physical machine and VM is Docker, Cloud PowerLinux innovation brings a good technical support:

  • Similar to virtual machines, but with little performance overhead
  • Have a higher utilization of resources
  • Better performance - the time required to create a new container is in milliseconds.
  • Management container can be managed as a virtual machine.

Yang Haiming finally stressed: cloud is an open environment, not only is an open, but every level needs to be open. Second, the only open architecture can achieve very good ideas, but you have to promote the real production environment, because enterprise's real production application environment must be does not match with the open source requirements and your professional knowledge is actually more important.

Sina cloud computing, the total person in charge, SAEFounder Cong Lei: cloud computing core operation and maintenance system

Sina cloud computing chief in charge, SAE founder Cong Lei think,Operation and maintenance is the core of cloud computing, but the operation and maintenance of a platform for China's enterprise cloud environment is very difficult, for the following reasons:

  • User unpredictability
  • Business unpredictable nature
  • Service diversity
  • Resource sharing
  • Early to boast the user to understand the deviation

Cong Lei said that the operation and maintenance of the responsibility lies in the service reliability (SLA), quality of service (Performance), cost optimization (Cost) to ensure that he introduced the SAE in these three levels of practical experience.

Sina cloud computing chief responsible person, SAE founder Cong Lei

Guaranteed service SLA has two points: to achieve the entire platform resources (including software, hardware, and people) can be managed and monitored. People's management, to establish a good duty system, reward and punishment system, the allocation of rights and tracking and operation and maintenance training. Cong Lei think,Cloud computing platform for all the faults from the change", including hardware changes (on-line machine & equipment, assembly, repair and renewal, relocation) and software change management (service on-line, offline, configuration changes, expansion). All hardware on the line, the assembly line including the warranty are strictly in accordance with the path to go,It is very important to disconnect the machine when the machine is off the line, first to maintain isolation from the outside world, if there is no problem to implement the machine off the assembly line process. Software change, there should be very strict process judgment is correct, not to say that brush a few pages and app, we should look at the state of the whole platform, but also to ensure that the upgrade in ten seconds can be rolled back, all leaders to sign. In addition, the establishment of fault management system, including the fault processing, fault upgrade system, fault summary, etc..

Resource monitoring is divided into three layers: platform, service and business layer. Platform monitoring includes the platform of the system, network, memory, landing rights, etc.. SAE monitor hundreds of pages to Zabbix based, through the Zabbix platform hardware resources to monitor. All business 200, 500 will do the peak change alarm. Also for all services API monitoring, all services are API, there is a mechanism for the timing of all services to get API, and to judge whether it is successful or failed. Status of the service to monitor, including PID, log health, GC fluctuations, etc.. Not only to see the PID is not in, but to know the replication service log is refreshed, but also to determine whether the process is blocked, there is no card. And to look at the operation of the entire service number and length, the service will not suddenly hung up.

Through the Zabbix platform hardware resources to monitor

Cong Lei pointed out that the monitoring of aError: monitoringAPILog, system parameters, but not from the business layer monitoring.He said that from the perspective of the actual user to do monitoring, to do a good job of user monitoring, request path tracking, life cycle monitoring. SAE the entire business from the user registration, login, certification audit, to the user to create the first APP, to the destruction of APP, and then to the final exit, there is a regular simulation. There is a robot to go through the regular process, once the problem will alarm.

Attention to alarm the several aspects, including focused on solving the "report" and "alarm over the abuse", distinguish between SLA and fault alarm, hierarchical (mail = > messages > SMS > phone), and built set alarm logic chain (SAE, efforts are being made by a).

Service quality, pay attention to the SLA list, service quality analysis, periodic extreme test and the work of a single incentive. Cost optimization is to reduce the cost of service units, including through virtualization, custom models, business mix run the way, and let the cost and efficiency of each development personnel understand the business and to avoid misuse of resources.

Finally, Cong Lei shared four points and small team operation:

  • Principle: dare to say no
  • Inside and outside the body, do not do any special support to any person
  • Responsible for the user, not responsible for the boss
  • Don't do division, full stack operation

Baidu Open Cloud senior architect Yang Yi: Baidu Open Cloud virtual network practice

Baidu Open Cloud senior architect Yang Yi shared one of the basic components of Baidu's Open Cloud Services - virtual network part of the technical practice. Yang Yi expresses, computing and storage service are basic existing standards and solutions, but the network service is not mature, now the nfv and SDN, can only solve the problem of cloud computing network is part of the problem (e.g. virtual private cloud of network put forward higher requirements), can't achieve complete network as a service.

Yang Yihui from the technical selection, engineering practice and so on in detail in the realization of the Baidu open cloud service process in the network to do some of the choices and thinking.

Baidu Open Cloud senior architect Yang Yi

Technology selection is divided into control plane and data plane technology.

Control plane:

  • Neutron: the industry's generic API, can be plugged plugin architecture design
  • SDN & OpenFlow: centralized control, software defined, this as a complement to make the network more flexible and robust

Data plane:

  • Choose the community is currently the most mature KVM and openvswitch scheme to realize stand-alone network virtualization and virtual access function, no choice physical switches to access the reason is mainly considered to software development is more controllable, of course, in the feature and development of advanced will make full use of the hardware offload technology to improve the performance.
  • Vxlan to achieve the overlay to achieve a large two layer technology, such a benefit is not to change the existing physical network structure, and provide good tenants isolation and scalability. Vxlan contrast nvgre/stt:L2-in-UDP, better use of multi path hash.
  • The final selection of the dpdk to achieve some x86 high-performance network equipment, on the one hand, is before the team has accumulated some experience, on the other hand also is almost standard in the industry.

Baidu Open Cloud network service control plan

Engineering practice, Yang Yihui from the three aspects of performance optimization, stability and scalability to expand the description.

Basic method of performance optimization on: establishing a baseline, including test case and measurable indicators; this is a key link; follow-up is not stop iterative, run the baseline test, mobile data and analysis result, application of optimization method, test indicators. Performance optimization is endless, according to the actual needs to determine the target.

Specific to the virtual network, the entire network path from the point of view:

  • In VM is guest this level, the virtio/vhost optimization, using copy/polling/ zero multi core multi queue to optimize the performance of virtual network adapter;
  • In the overlay, that is physical host this level, considering the Vxlan characteristics, open the card UDP queue support, consider to overlay the data packet over Ethernet MTU, in slice especially offload vxlan agreement is divided into support has done a lot of work.
  • The way is realized by using the dpdk some high performance network middle box components, such as vrouter/vswitch. In userspace to network processing, to be much higher than in the kernel to achieve performance, but also more flexible, develop more aspects, without taking into account the compatible with standard protocol stack trace.

In terms of stability, Baidu generally use SLA this indicator to measure the availability of services. Yang Yi think,SLACloud services is the biggest challenge. Calculation method of SLA:Usability= (MTBF / (MTBF + MTTR)) X 100, where MTBF represents the average time between services that are able to provide services between the two interrupts, while MTTR represents the average recovery time after the service is interrupted.

From this point of view, on the one hand, to reduce the number of failures, on the other hand, to reduce the recovery time after failure. Common solutions include redundancy by multi group hardware, software to remove SPF single point problem, etc..

Baidu open cloud of actual engineering practice, hardware, machine physical uplink using dual NIC bonding + switch 2 virtual technology to ensure high availability, in the software project, did not take the community to provide L3-HA/DVR scheme, the reason is assessed and community discovery in these programs first are more complex, introducing many new components, increased a lot of difficulty of the operation and maintenance; secondly in a larger scale, such as an increase in the number of vRouter, will be a lot of bug; Baidu do eventually stability improvements include:

  1. The realization of the thermal upgrading mechanism to avoid the normal upgrade service interruption, making the rapid iteration possible;
  2. By introducing the openflow and SDN controller. By way of push - pull in computing nodes due to the flow table is not properly configured to cannot correct forwarding packets and find controller to pull a flow table. To solve the problem of MQ communication over unreliable.
  3. The use of ECMP, the realization of the bvrouter cluster, completely solve the problem of vRouter single point; and also solve the problem of vRouter convergence ratio

After the actual deployment of OpenStack, Baidu found it difficult to expand:

  • First Server API itself no state, you can simply by adding Server API to enhance scalability, but in fact the number of locks / parallel processing between Server bug has a lot of API
  • Its dimension data is stored in DB; the reading and writing of DB itself is also a problem.
  • MQ may also become a bottleneck in the node size of more than 500 units when the problem highlights
  • Community provides Cell configuration is more complex

Baidu in terms of scalability, there are these design principles:

  • Control plane: localization processing as far as possible; simplified state.
  • Data planes: remove the path that may become a bottleneck, such as DHCP, vRouter; eliminate the convergence ratio; do not broadcast to unicast.

Yang Yi finally shared the experience a: without a hard knock single cluster scalability, the characteristics of public cloud determines the partition according to the user.

Qingyun QingCloudSystem engineer Yang Jintao: Qingyun QingCloud storage practice

Qingcloud in storage system design ideas, eventually in architecture exhibits the morphology, and in the final products products is introduced Qingyun qingcloud System Engineer Yang Jintao including block storage, sharing and object storage.

Qingyun QingCloud System Engineer Yang Jintao

Qingyun storage in the physical device level distinction between the SSD, SAS and SATA, to provide users with high performance of block storage, and large capacity block storage, and ultra high performance block storage. Shared storage is the piece of equipment will be provided to the traditional way of enterprise storage exposed to exposure, and now the way is exposed to the user Storage way. In the block device based on the object, the object can be stored directly in the physical resources, can also run on the virtual machine. Block storage, shared storage, object storage to achieve integration of the three, that is to say, block storage can be DAS, shared memory or object storage.

Qingyun storage system architecture diagram

Yang Jintao believes that the current open-source distributed storage system is very famous three project glusterfs, CEPH and sheepdog, in a distributed manner to provide storage block, but still does not solve the traditional storage scheme of centralized network bandwidth bottleneck, but in the public cloud, all of the computing resources to through the network channel, it is difficult to meet the demand of bandwidth, even meet, the deployment of each computing resource is very troublesome. Qingyun block storage design, in addition to the software defined, in terms of performance and capacity also must meet, in addition, it is also compatible with traditional enterprise operation scheme. The final solution, in their own design to do integration, including access to the path of integration, the entire storage system to the nearest access, but the storage system is still in the logic of distributed systems.

Shared storage design, in order to meet the needs of traditional enterprise operations, currently provides Storage way, the next step to do NAS, like SAMBA and CIFS.

Design of object storage, Because users do not know what kind of data storage, do not know the big data, so to solve a common problem. Qingyun first approach is multi regional route, another type of solution is not limited, there is no limit to the number and capacity of the largest. Structure of the overall structure of the object, the first Load area, according to the user request for re project, or the analysis. Above GATEWAT can be arbitrarily extended. Area inside the storage structure, the top LB and LI this layer can be any level of expansion, the lower in order to solve the problem of indexing, and do a distributed index cluster, you can see the first level and the two level index. In order to prevent a large number of small files indexed data even than the data is actually stored more needs to be done with, the premise is to store data in the same user storage space, do not cross the user will not merge across storage space.

Qingyun object storage architecture

Yang Jintao finally introduced the Qingyun stored Roadmap, including the following eight aspects:

  • CDN
  • Graphic image processing, audio and video processing
  • QingCloud integration of various storage services
  • Integration with other QingCloud services
  • Combined with the customer storage environment
  • client
  • Cheaper storage hardware
  • Integration with open source projects

Overall, Qingyun storage design idea is unified hardware architecture; compromise of distributed storage system; multidimensional product design.

Intel cloud computing and Big Data Solutions Architect Hu Xiao: Intel video cloud

Intel cloud computing and data to solve the solution architect Hu Xiao introduced Intel in Internet video applications and video cloud computing technology to help people understand today there are better to achieve video related cloud computing, involved in online video player, cloud of graphics and image technology and visual content analysis.

Intel cloud computing and Big Data Solutions Architect Hu Xiao

For online video playback, Intel provides E3 V4 Xeon for online video and cloud workstation work load optimization, including providing hardware acceleration for video transcoding. Hu Xiao said, Intel CPU for two years to move forward, but E3 V4 Xeon integrated high-end graphics image engine Graphics, compared to the previous generation increased by 1.4 times the transcoding capability. And a high degree of interaction between the CPU and graphics and CPU engine can increase the whole procedure, solve the bottleneck of the online video data exchange process, provide more efficient transcoding efficiency.

By GPU virtualization technology, Intel from the performance, functionality and sharing of three levels to support cloud graphics images, so as to achieve efficient, high-performance video cloud infrastructure. GPU virtualization has three typical ideas.

  • Forwarding API: the API to Backned, you can send the Graphics API to deal with, and then return. The biggest limitation is that the function is very limited, and compatibility is a lot of trouble, because you need to be compatible with a wide variety of API.
  • Driver Graphics: the virtual machine initialization preparation is to be formulated which a virtual machine can be exclusive CPU, can support the various customers in the cloud computing grade and grade, non VIP virtual machine sharing VIP, GPU customers can exclusive GPU.
  • GPU Virtualization Full: there are two virtual machine above the graphics driver, control flow and control IO, forward to the Hypervisor layer, you can turn to find small forward to CPU, so that CPU processing, better performance, does not affect the VM on the different protocols.

Intel media Cloud Architecture

Can make use of existing public cloud or private cloud software architecture combined with Intel in Graphics virtualization, OpenStack as an example, you can use Hypervisor Nova or Hypervisor KVM on the Xen node. Application through the virtual machine isolation, virtual machine on top of the need to use the system provided by Intel.

Based on the centralized storage of video cloud technology, you can do large-scale video analysis. In terms of visual content analysis, Intel has done a IDLF deep learning framework, as well as high-performance codec support for large-scale complex video analysis. Idlf project can help you build a network, support the use of a variety of different computing framework, architecture, including CPU instructions, including the GPU instruction, also based on platform level do the code is optimized to provide high performance, and can be used with a software framework to achieve a variety of types of neural network. Visual analysis of the GPU is, if you want to solve the real-time image and video analysis requirements, it is necessary to build a streaming framework for real-time video analysis, different stage still to graphics encoding and decoding ability, making the flow cytometry analysis can carrying video services. For example, the matching degree of face recognition is to be done in the back end of the big data analysis, if you add real-time requirements, you must use CPU, GPU hybrid architecture.

Seven cattle cloud storage chief architect Li Daobing: cloud storage boost the development of video surveillance industry

Seven cattle cloud storage chief architect Li Daobing cloud storage to share how to solve the pain point of the short video industry. The short video products crazy growth, major pain points in data uploading, data storage, data processing cluster, data distribution and so on, but with the wide application of SSD, database processing capacity has raised 2-3 orders of magnitude, structure bottlenecks but easy in data storage level.

Seven cattle cloud storage chief architect Li Daobing

Li Daobing said that the development of cloud storage has been mature enough, most of the needs of the data can be resolved. Many cloud storage provide audio and video processing services, but also because of the cloud storage itself processing capacity, can cope with the peak impact. For example, cloud storage can be designed for review and examination of transcoding: seven cattle do double speed, resolution 150x150. Remove the sound, lower rate, 10 seconds of video just about 40KB; can intelligent scheduling optimization of CDN, fault avoidance CDN service is not available, use multi domain name and IP download to avoid domain name hijacking.

He also said that the future of the world, more and more important data, such as the collection of web pages into a search function to meet the user, to collect the user's search and access behavior into advertising on the engine to meet customer.

More exciting content, please pay attention toThe seventh China Cloud Computing Conference live coverageSina Weibo, Sina Weibo@CSDN cloud computing.

step on