Return Sign in

Infrastructure construction practice of big data platform based on Cloud Computing

The current trends of large data base construction platform is cloud and open, this platform can provide PAAS services related to all kinds of big data, also need to make all kinds of services can be simple and flexible combination of changeable and customized to meet the needs. How to provide flexible, agile, but not stable and high performance of the big data platform? How to effectively use the characteristics of cloud computing to develop large data platform?

The Qingyun qingcloud System Engineer Zhou Xiaosi for everyone to bring sharing based on cloud computing data infrastructure platform, and the architecture features of the subject.

The following is to share the original text.

-- -- -- --

Good evening, everyone. I am week four, the English name ray, rivers and lakes honorific Siye. Now responsible for big data platform qingcloud development. Today to share with you about the construction of a large data base platform in the cloud, the big data I mentioned below is refers to the large data base platform, such as Spark, Hadoop, etc., rather than referring to the upper application.

I will communicate with you from four aspects: cloud computing and big data, the challenges of the construction of large data platform on the cloud, big data base platform, data format.

Cloud computing and big data

Believe that we usually have more contact with the scheme of physical data, originally this topic I don't want to always tell, because in our view the development direction of big data is the cloud and open source, is a logical thing, but in the actual implementation will encounter some resistance. This is because we have quite a few people or a physical machine world big data thinking and mistrust of cloud computing, a little sign of trouble on the bosom of cloud computing, which is obviously wrong. Doubt big data cloud is no less than the stability and performance, but the good news is that more and more people have realized that it is also recognized the direction of development, I believe that this is no longer a topic.

We are starting from the big data itself. We do a large data item in the preparation, first determine the demand, then is the platform selection, selection of platform is one of the most difficult and most important, is everybody's most confused link, I met the customer basically are in this issue have varying degrees of correct nodes, this is entirely understandable, because too many things and more new things plainly to keep out.

In fact, the selection of the platform depends entirely on your needs, you are real-time computing or offline computing, is to deal with structured data or unstructured data, your application has no transactional requirements, etc.. Identify these needs to find the appropriate platform on the line, which requires us to understand the characteristics of each platform. We know that there is no platform to solve all the problems, Spark no longer strong and storage, a lot of scenes and Hadoop / HBase / object storage, such as the use of, let alone replace the data warehouse.

Choose platform or tool is not a fashionable, suitable is the most correct, something does not necessarily only Hadoop or spark to solve, such as redis provides to statistically independent events a good data structure hyperloglogs, and memory the most only will use 12K bytes, with the number of independent events unrelated to error is less than 1%, then use this to count each time independent things such as UV is very good choice.

Each platform has its own specific usage scenarios, we should not only understand it, even many times we will for each candidate platform to do a PoC or benchmark test, the time cloud computing reflects the advantage, you can quickly and low cost to do the test.

Of course, the advantages of cloud computing not only these, the era of big data has many uncertainties, can say half a year after the amount of data you will be added to the number of not too many people, the elastic cloud computing can very good solution to this problem, need how many will increase the number of resources, but also the release of excess resources to other business use, up and down about arbitrary expansion. All of these can be through a mouse click on a few minutes to complete. You can even by calling the API to manipulate these platforms, for example, my program receives the data, I start my spark clusters to deal with these data, treatment after I can shut down the cluster; can also through a timer or automatic telescopic function to accomplish these things, thus greatly cost savings.

Cloud computing is not only flexible, agile, but also very flexible, you can arbitrarily collocation some components to form a different solution. Such as one thing we need to do now is based on the data of arbitrary switching calculation engine, because we know that big data is calculated to go along with the data, data where calculated rushed there, so some users of MapReduce are more familiar with, he may is Hadoop, but for some time he wanted to spark, the time can not let users to copy data to the cluster of spark, but should replace the MapReduce above into a spark, the data are still the original HDFS. All of these can help us to put time and energy at the operational level, instead of going to the big data platform Daoteng complex.

Two, the challenge of building a large cloud data platform

Can be seen in the cloud on the big data can give us bring unparalleled experience, but cloud data the key is not these things, but stability and performance, it is also suspected of big data cloud of the main points. And these two points depends on the ability of IAAs. Test your is virtualization technology, OK, pressure is not a kenel panic, but we have never encountered this problem, so I would not say this.

Performance of the problem really need to spend a great deal of gas said, performance of disk I / O performance and network performance, disk performance if from the same configuration of single node, virtual machines do not have good physical performance. This is because there is always a loss virtualization, however, if you virtualization technology is good enough and loss can be reduced to very low, and cloud computing is on the lateral expansion to solve complex problems, rather than relying on longitudinal expansion, a node is not my a node. And we now think of a better way to solve this problem, so that the performance of the disk to enhance the performance of.

Network performance in the physical world, especially in the multi node, if one is not careful network configuration is not good enough, performance will be poor. We recently released SDN 2.0 will help our data solves this problem, all between the host communication network are directly connected, with node does not matter, and inter node bandwidth can reach 8 GB, is close to the upper limit of the physical network card stand up comedy the. Moreover now 25 Gb network card costs also more and more close to 10 Gb of the network card, so the network should not be a problem, of course, the premise is that your SDN technology enough cattle.

On the disk I / O I'll add, we know that the default HDFS replica factor is 3, but in the cloud will become not the same, you if in a cloud service providers to deploy their own Hadoop, you should not set the copy of the three factors. This is because Hadoop design a third copy was intended to prevent the whole frame and the third copy in another frame, you deployed in someone's home when you certainly don't know this information, so the third copy is meaningless, and any IAAs service providers will provide a copy of the level of resources and the safety performance of the data is guaranteed, so I should be more a third copy is removed, remove the copy can save 1 / 3 of the space performance can be upgraded.

But not because IAAs has a copy of the HDFS is reduced to a copy. The reason is that you need at the operational level of HA, the IAAs copy can only ensure data is not lost, the physical machine a failover needs a few minutes time, if HDFS has only one copy of the switching process of business will be affected, so two copies are a must. Even if this is not the best solution, because the business layer 2 copies plus IaaS layer at least 2 copies, add up to at least 4 copies, 3 copies of the physical machine program is still a gap. So it is best to remove the bottom copy, in the cloud to achieve physical machine world 3 copies of the program, and then add the rack awareness, this is with the physical machine deployment as a, but is to cloud delivered to you. This work IaaS provider is able to do, because the information is available to get the.

Three, big data base platform


Next, we'll look at what data platform and their characteristics, from the life cycle of data acquisition, transmission, storage, analysis calculation and show several stages, in the photo above describe these stages is now more popular tools and platforms.


First, talk about the calculation, such as Spark, Storm, MapReduce, etc., their difference is mainly in real-time calculation and off-line calculation, and then affect their throughput. MapReduce is a veteran of the large data calculation engine, each map and reduce stages through the hard disk for data interaction, the disk I / O requirements is relatively high, speed is slow, so suitable for off-line calculation, which leads to any thing related to MapReduce are relatively slow, such as the hive.

Storm real time is relatively high, but the throughput is relatively small, so it is suitable for real-time analysis of small data blocks to calculate the scene. Twitter known as Heron lower than the Storm delay, higher throughput, will open at the end of last year, but I do not seem to have seen more news, patience to look forward to it.

Spark streaming is more suitable for near real-time data with larger block analysis and calculation, the spark is a computing system based on distributed memory, the official tone said it than Hadoop MapReduce to 100 times faster, in fact, spark core is RDD calculation model and global optimal DAG based on directed acyclic graph layout, and MapReduce is a focus in local computing model, as a direct result of the spark even based on hard disk to 10 times faster than MapReduce. Spark is a very worthy of research platform, I believe we all know how good it is.

For SQL analysis is now mainly divided into two major schools, based on the MPP data warehouse and SQL-on-Hadoop.

Now it seems that the latter accounted for the upper hand, one of the main reasons is the former requires specific hardware support, but MPP data warehouse in traditional industries, there is a big market, is also very popular traditional industry welcome, because it has the Hadoop there is no things, such as actually supports standard SQL, distributed transaction support, the MPP data warehouse can be a very good integration of traditional industry existing Bi tools. In addition, MPP data warehouse is also close to the Hadoop, support for ordinary X86 servers, the underlying support for Hadoop storage, such as HAWQ Apache. Qingyun at the end of March will provide MPP data warehousing services, by hawq author and GreenPlum of R & D personnel to cooperate with us in developing this service. SQL-on-Hadoop is more and more, such as spark SQL, hive, Phoenix, kylin, and so on, the hive is to convert SQL for MapReduce tasks, so the speed is relatively slow, if there are requirements for running speed can try to spark SQL, learning is also very simple, spark SQL 1.6.0 performance has greatly improved, everyone interested can experience.

There are based on the MPP Hadoop analysis engine Impala, Presto, and so on, I will not introduce the one one. It should be noted that some of the projects are still in the incubator Apache, if you want to use in the production environment to be careful. This place interesting is that everyone with a hive than the conclusion is many times faster than the hive, this is affirmative, we would like to see more of these new SQL between ratio is what, don't always take the hive than, perhaps little brother to be bullied.


Storage is mainly Hadoop/HDFS, HBase, object storage and MPP data warehouse. Hadoop is suitable for large file one-time write, read many scenes, cannot write many small files, the namenode is easy to beat, if you want to write small files can search on the net some tips. HBase for random read and write the scene, it is a NoSQL distributed database, is a sparse, distributed, persistent, multidimensional sorted map, to understand every word to understand what a HBase is, with its underlying or HDFS, but in scene analysis such as scan data when its performance is better than Hadoop, the poor performance of 8 times more. HBase strong in random read and write, Hadoop strong in the analysis, now there is a Apache incubator called Kudu "golden mean" project, which is both random read and write and analysis of performance. HBase to emphasize the point is data model design, in addition to the importance we all know rowkey design, not by traditional relational database thinking model, in the field of big data more as far as possible denormalize.

Transmission is now the mainstream is Kafka and Flume, the latter has encryption function, Kafka need to do encryption in the business layer, if there is a need. Kafka is a distributed, multi copy, high throughput and low latency message system. It is the best choice for active stream data processing, such as log analysis.


The image above is from a real customer Kafka real-time monitoring of the interception of the map, can see the inflow of the outflow of the two curves completely overlap, we can see that its delay is very low (the level of milliseconds).

But we can't abuse the Kafka, once I met someone want to with Kafka do distributed transactional business such as transaction, but Kafka doesn't claim to it support for message delivery is exact once, it can do is at least once, so distributed transactional business should is not suitable, need the business layer to do some work.

Four, data format

The last one I want to emphasize is the data format, the correct choice of data format for big data can not be stressed too. Choose the wrong will be a great waste of storage space, large data volume is large, the waste can not afford to double space, performance will be reduced dramatically because of the wrong format, and even can not be carried out.

Data format to remember two points, can be divided and can block compression. Divisible means a large file are cut from the middle, the analyzer can not alone explain the two files, such as XML, it has open tag and close tag, if in the middle of a sword, the XML parser will not know. But CSV is not the same, it is one of the records, each line is a single out or a sense of.

Can block compression refers to is each segmented block alone can decompress. This is because earlier data is calculated to go along with the data, so the calculation of each node is the data analysis of the local, in order to achieve a parallel computing. But some compression formats such as CSV, gzip in the decompression of the need to start from the first block in order to extract the success of the block, so you can not do the real parallel computing.

Five, summary

Finally, the speaking in front of a few points: big data is the development direction of cloud, cloud computing is large data base platform best deployment scenarios; big data to solve scheme should choose according to your needs to the platform; data format of choice is very important, usually remember to choose divisible and can block the compressed data format.