Subscription programmers magazine RSS CSDN home> Programmers magazine

Spark: big data, things that vanishes in a flash"

Published in15:42 2013-07-08| Time reading| source"Programmer"| ZeroArticle comments| authorWu Gansha

Abstract:Spark is originated in the United States, University of California at Berkeley, AMPLab cluster computing platform. It is based in memory computing, starting from the iterative multi batch processing, data warehouse eclectic, flow processing and chart calculation and so on many kinds of computing paradigm is rare all-around player.
Spark has formally applied to join the Apache incubator, from the laboratory of brainwave "sparks" growth of big data technology platform for the emergence of new cutting edge. This paper mainly describes the design idea of Spark. Spark as the name suggests, to show the big data not common "things that vanishes in a flash". The specific characteristics are summarized as "light, fast, spirit and skill".

  • light: Spark 0.6 core code has 20 thousand lines, Hadoop 1 for the 90 thousand line, 2 for the 220 thousand line. On the one hand, thanks to the Scala language of the concise and rich expression; on the other hand, Spark is a good use of Hadoop and Mesos (Berkeley Another entry incubator project, the main cluster of dynamic resource management infrastructure. Although very light, but not in the design of fault tolerance on the discount. Creator Matei claimed: "does not make mistakes when exception handling." Speech Fault tolerance is a part of the infrastructure.
  • fastThe spark of small data set can achieve sub second delay, which for Hadoop MapReduce (hereinafter referred to as the MapReduce) is unimaginable (due to heartbeat interval mechanism, the only task start is tens of seconds of delay). For large data sets, the typical iterative machine Learning, ad hoc queries (ad-hoc query, graph computing, spark version is based on the fast implementation of MapReduce, hive and pregel 10 times to 100 times. Memory computing, data locality (locality) and transmission optimization and scheduling optimization of the top power, with the beginning of the design of light weight uphold the concept of relationship.
  • spiritSpark provides flexibility at different levels. At the implementation level, it is perfect interpretation of the scala trait dynamic mixed (mixin) strategies (such as replaceable cluster scheduler, serialization Library); in the source (primitive) layer, which allows new data operator expansion A variety of operator (), new data sources (such as HDFS support dynamodb), new language bindings (Java and python); in the paradigm layer, spark support memory calculation and iterative multi batch processing, ad hoc queries and stream processing and chart calculation etc. Paradigm.
  • QiaoOn occasion: clever and leveraging. Spark by the potential of Hadoop, and Hadoop seamless combination; then Shark (Spark on the data warehouse to achieve) by the potential of Hive; figure calculated by Using Pregel and API's PowerGraph and PowerGraph's point division idea. Everything, all with the help of the Scala (widely known as the future of Java instead of The potential: Spark programming Look'n'Feel is the original Scala, whether it is grammar or API. In the implementation, and leveraging smart. To support interactive programming, Scala just Shell small do modify the Spark (in contrast, Microsoft to support JavaScript Console on MapReduce interactive programming, not only to thinking barrier across the Java and JavaScript in the implementation but also the fuss).

Say a lot of benefits, or to point out that Spark is not perfect. It has inherent limitations, can not well support fine-grained, asynchronous data processing; also has the reason of the day after tomorrow, even great genes, after all is still in its infancy, there are a lot of space for scalable performance, stability, and paradigm in.

Computing paradigm and abstraction

Spark first is a coarse-grained data parallel (parallel data) computing paradigm.

Data parallel and task parallel (parallel task) the difference is reflected in the following two aspects.

  • The subject of calculation is a collection of data, rather than the individual data. The length of the set depends on the implementation, such as SIMD (single instruction multiple data) vector instruction is generally 4 to 64, SIMT GPU (single instruction multiple threads) general Is 32, SPMD (single program multiple data) can be more wide. Spark processing is big data, so the use of a very coarse grain size, called Distributed Resilient Datasets (RDD).
  • All the data in the set are the same operator sequence. Data parallel programming is good, easy to obtain high parallelism (with data size related, rather than parallel with the logic of the program), but also easy to map in the end Parallel or distributed hardware. The traditional array/vector programming language, intrinsics CUDA/OpenCL, SSE/AVX, Ct (for C++ Throughput), all belong to this category. The difference is that Spark's view is the entire cluster, rather than a single node or parallel processor.

Data parallel paradigm determines the Spark can not be perfect to support fine-grained, asynchronous update operation. Figure calculation has such operations, so at this time Spark is not as good as GraphLab (a large-scale graph computing framework); there are some applications, Requires fine-grained log update and data checkpoints, and it is also not as good as RAMCloud (Standford's memory storage and computing research project) and Percolator (Google incremental computing technology). And that, in turn, the spark to carefully cultivated it at the application field, trying to size sweep the deck of a Dryad (Microsoft's early big data platform) but not very successfully.

RDD Spark, using the Scala collection type of programming style. It also uses the functional semantics (semantics functional): one is the closure, the two is the RDD can not be modified. Logically, every RDD operator generates a new RDD, which has no side effects, so the operator is called deterministic. Have the operator is idempotent, error occurred when the operator sequence can be re executed.

Spark is the calculation of the data stream, but also with a working set (set working) of the data stream. Stream processing is a data flow model, MapReduce is also, the difference is that MapReduce needs to maintain a working set in a number of iterations. The abstract work set is very common, such as Iterative machine learning, interactive data mining and graph computing. To ensure fault tolerance, MapReduce uses a stable storage (such as HDFS) to carry the work set, the price is slow. HaLoop using cycle Sensitive scheduler, to ensure that the previous iterative reduce output and the iterative map input data set on the same physical machine, which can reduce the network overhead, but is unable to avoid disk I / O bottleneck.

Spark's breakthrough is that, in the premise of ensuring fault tolerance, using memory to load the working set. The memory access speed is faster than the number of disks, which can greatly enhance the performance of the. The key is to achieve fault tolerance, traditionally there are two ways: day Records and inspection points. Taking into account the data redundancy and network communication overhead, Spark uses log data update. Fine grained log update is not cheap, and in front of said, Spark is not good at. Spark record is coarse grained RDD update, so the overhead can be ignored. In view of functional semantics and idempotent properties of Spark, tolerant by replaying the log update, there will be no side effects.

Programming model

A code: operator textFile from HDFS read log file and return to "file" (RDD); filter operator to sieve out with "error", assigned to the "errors" (RDD); cache operator to cache it down to prepare for future use; the count operator returns the number of rows of "errors". RDD looks like the Scala collection type There is not much difference, but their data and operational models are quite different.

Figure 1 shows the RDD data model, and this will be used in the four to four types of operator mapping operator. Spark programs work in two space: RDD Spark space and Scala primary data space. In the original data space, data for the scalar quantity, i.e. basic types of Scala, orange squares said), collection types (blue dotted line (box) and persistent storage (red cylinder).

Figure two switching of 1 spaces, four different types of RDD operators

Input operator (orange arrow) will Scala collections data type or storage in RDD space inhalation and turn into RDD (solid blue box). Input operator input can be divided into two categories: a class for Scala collection types, such as parallelize; another in data storage, such as in the preceding example textFile. The output of the input operator is the RDD space of the Spark.

Because of the function semantics, RDD transforms (transformation) operator (blue arrow) to generate a new RDD. Both the input and output of the transform operator are RDD. RDD will be divided into a lot of partitions. (partition) distribution to multiple nodes in the cluster, figure 1 represents the blue square partition. Note that partitioning is a logical concept, and the old and new partitions before and after the transformation may be physically the same block of memory or memory. Storage. It is very important to optimize the memory requirements in order to prevent the functional type of invariance resulting in an unlimited expansion of the memory. Some RDD is the intermediate result of the calculation, and its partition does not necessarily correspond to the corresponding memory or storage, if required. (such as in the future use), you can call the cache operator (CACHE operator in the example, the gray arrow) will partition the physical (materialize) down (gray box).

A part of the transformation operator as a simple element of the RDD elements, is divided into the following categories:

  • Input and output one to one (element-wise) of the operator, and the results of the partition of the structure of the RDD is the same, mainly map, flatMap (map after flattening for one dimensional RDD);
  • Input and output one to one, but the results of the RDD partition structure has changed, such as the Union (two RDD), coalesce (partition);
  • Select some elements from the input operator, such as filter, distinct (removal of redundant elements), subtract (the RDD has, it RDD no elements to stay) and sample (sampling).

Another part of the transform operator for the Key-Value collection, and divided into:

  • For a single element-wise to do RDD operations, such as mapValues (to maintain the source RDD partition method, which is different from map);
  • For a single RDD rearrangement, such as partitionBy, sort (to achieve consistency of the partition, the local optimization of the data is very important, followed by);
  • RDD based on a single key for restructuring and reduce, such as groupByKey, reduceByKey;
  • RDD based on the two join for key and restructuring, such as join, cogroup.

After the three types of operations are involved in the rearrangement, known as the shuffle class operation.

From RDD to RDD transform operator sequence, has been in the RDD space. Here is a very important design is evaluation lazy: the calculation does not actually happen, just keep recording to the metadata. The metadata structure is DAG (directed acyclic graph), each of which "vertex" is RDD (including the production of RDD Operator), from the parent RDD to the sub RDD has the "edge", which indicates the dependence between RDD. Spark took a cool name to DAG Lineage (lineage) metadata. This Lineage is also in the front of the fault tolerant design of the log update.

Lineage has been growing, until the event of action (action) operator (the green arrow in Figure 1), then it is to evaluate, the cumulative sum of all the operator one-time execution. The input of the action operator is RDD (and all RDD dependent on the Lineage), and the output is executed after the RDD. Into the original data, may be Scala scalar, collection type of data or storage. When the output of an operator is the above type, the operator must be an operator, and the effect is to return the original data from the RDD space. Space.

Action operators include the following: generate scalar, such as count (returns a count of the number of elements in the RDD), reduce, fold/aggregate (see Scala name operator document); return several scalar, such as take return the first few elements; generated Scala collection types, such as collect (the RDD in all the elements into the Scala collection type), lookup (search for all the values corresponding to key); write storage, such as the textFile corresponding to the previous saveAsText-File. There's a check. Check operator checkpoint. When Lineage is particularly long (which often occurs in the calculation of the graph), when an error occurs to re execute the entire sequence for a long time, you can take the initiative to call the current checkpoint data is written to the stable storage, as a checkpoint.

Here are two design points. The first is evaluation lazy. Familiar with the compilation of all know, the compiler can see the greater the scope, the more the opportunity to optimize. Although there is no Spark compiler, but the scheduler actually do to DAG linear. Optimization of hybrid. Especially when the spark above can be calculated with a variety of mixed paradigm, the scheduler can break of different code paradigm boundaries are global scheduling and optimization. The following example of the SQL Shark code And Spark machine learning code mixed together. After the translation of each part of the code layer RDD, the integration into a large DAG, so you can get more opportunities for global optimization.

The other point is that once the operator generates the primary data, it is necessary to exit the RDD space. Because the current Spark is only able to track the calculation of RDD, the calculation of the original data is not visible to it (unless later Spark will provide the native data type operation of the heavy load, wrapper or conversion implicit. This partially invisible code may introduce dependencies between the front and back RDD, such as the following code:

The third row filter for errors.count () dependence cnt-1 the native data operations generated by, but the scheduler can not see the operation, it will.

Because the Spark does not provide flow control, in the calculation of logic needed conditional branches, must also go back to Scala space. Because Scala language support for custom control flow is very strong, do not rule out the future Spark will also support.

Spark also has two very useful functions. One is the broadcast (broadcast) variable. Some data, such as lookup, may be used repeatedly in multiple tasks; the data is much smaller than RDD, and no Should be divided between nodes like RDD. The solution is to provide a new language structure, broadcast variable, to modify the data. Spark runtime to the broadcast of the content of the modified broadcast to each node, and To survive, the future and without sending. Compared to distributed's cache Hadoop, the broadcast content can be shared across jobs. Spark Mosharaf from P2P who submitted the old master Ion Stoica, using the BitTorrent (yes, that is, to download the movie that BT) simplified implementation. Interested readers can refer to the paper SIGCOMM'11 Orchestra. Another feature is Accumulator (from MapReduce counter): allow the Spark code to add a number of global variables to do Bookkeeping, such as the record of the current operating indicators.

Operation and scheduling

Figure 2 shows the running scene of the Spark program. It consists of two phases: the first phase: the first phase of the transformation operator sequence, incremental construction of DAG map; the second stage by the operator touch, DAG DAGScheduler map into the task and the task set. Spark supports local single node running (development and debugging useful) or cluster operation. For the latter, the client runs on Master nodes, through the manager Cluster to partition a good partition of the task set to the worker/slave node to send to the cluster.

Figure 2 Spark program running process

Spark traditional and Mesos "Jiao", also can support EC2 Amazon and YARN. The base bottom scheduler is a trait, it can be mixed with the actual implementation is different. For example, on the Mesos two scheduler, a all each node Resources are distributed to Spark, and the other allows the Spark job to be scheduled and shared with other jobs. Worker node on the task thread (thread task) to really run the task of DAGScheduler generated; and block Manager (block) Manager (block) is responsible for the manager master master communication (the perfect use of the Actor Scala mode), to provide data blocks for the task.

The most interesting part is DAGScheduler. Following its detailed working process. RDD data structure is very important in a domain is dependent on the parent RDD. As shown in Figure 3, there are two types of dependencies: narrow (Narrow) dependence and wide (Wide) dependence.

Figure 3 narrow dependence and wide dependence

Narrow dependency refers to the parent RDD of each partition is used by a sub RDD partition, which is shown as a parent RDD partition corresponds to a sub RDD partition, and the partition of the two parent RDD corresponds to a sub RDD Partition. In Figure 3, map/filter and union belong to the first category, and the input is divided into second categories (join) of the co-partitioned belong to the category.

The partition of a wide dependency RDD is dependent on all partitions of the parent shuffle, which is because of the RDD class operation, as shown in Figure 3 in the groupByKey and the join.

Narrow dependence on optimization is very favorable. Logically, each RDD operator is a fork/join (this join not the above join operator, but refers to the synchronization of multiple parallel tasks barrier): The calculation fork to each partition, after the end of join, and then fork/join under a RDD operator. If directly translated to physical implementation, it is not economical: one is every RDD (even if the Is the middle of the results) need to be materialized into memory or storage, time-consuming space; the two is the join as a global barrier, is very expensive, will be the slowest of the node drag dead. If the partition of sub RDD to The partition of the parent RDD is narrow, it can be the implementation of the classic fusion optimization, the two fork/join into one; if the continuous transformation of the operator sequence are narrow, you can put a lot of Fork/join and for one, not only reduces the amount of global barrier, but also does not need to materialize a lot of intermediate results RDD, which will greatly enhance the performance. Spark call this assembly line (pipeline) optimization.

Transformation operator sequence a touch shuffle class operation, the wide dependence occurs, the optimization of the termination of the pipeline. In the specific implementation, DAGScheduler from the current operator to go back to the dependence graph, a touch of wide, it generates a stage to accommodate the traversal of the operator sequence. In this stage, it is safe to practice. Pipeline optimization. And then, from that wide dependence began to continue to go back, to generate the next stage.

To study two problems: first, how to divide the partition; two, the cluster partition in which node. This exactly corresponds to the other two domains in the RDD structure: the partition partition (partitioner) and the preferred location (preferred). Locations).

Partitioning partitioning is critical for shuffle operations, which determines the dependency between the parent RDD and child RDD of the operation. As mentioned above, the same join operator, if the cooperative division of the words, the two father Between RDD, the father of RDD and sub RDD can form a consistent partition arrangement, that is, with a key guarantee to be mapped to the same partition, so you can form a narrow dependence. On the contrary, if there is no synergistic division, resulting in wide Depend on.

The so-called collaborative division, is to specify the partition to produce a consistent partition of the partition. Pregel and HaLoop put this as a part of the system; while Spark provides two types of partitions by default: HashPartitioner and RangePartitioner, allowing the program to be specified by the partitionBy operator. Note that HashPartitioner can play a role in the requirement that the hashCode key is valid, that is, the same content of key produces the same hashCode. For this String is set up, but the array is not set up (because the array of hashCode is determined by its identity, rather than content, generation). In this case, the Spark allows the user to customize the ArrayHashPartitioner.

The second problem is the partition of the node, which is related to the data locality: good local, network communication is less. Some RDD generated when there is a preferred location, such as the HadoopRDD partition is the preferred location of the HDFS block where the node. Some RDD or partitions are cached, and that calculation should be sent to the node where the cache partition is located. No However, on the back of the lineage RDD has been found to have the preferred location attributes of the parent RDD, and accordingly determine the sub RDD placement.

The concept of wide / narrow dependence is not only used in scheduling, but also useful for fault tolerance. If a node downtime, and operation is narrow reliance on, that as long as the lost father RDD partition recalculation can, with other nodes does not depend on. The wide dependency requires all partitions of the parent RDD to exist, Recalculation is very expensive. So if you use the checkpoint operator to do the inspection point, not only to consider whether the lineage is long enough, but also to consider whether there is a wide dependence on the width of the plus checkpoint is the most Some value.


Because space is limited, this paper can only introduce the basic concept of Spark and design ideas, content from the Spark of several papers (with Resilient "Distributed" NSDI'12 A Fault-Tolerant Datasets: Abstraction for In-Memory Cluster Computing "as the main", but also my colleagues and I study the experience of Spark, as well as many years engaged in parallel / distributed systems research insights. Spark core members of /Shark cast Xin Shi Thank you for your review and revision of this article!

Spark stands on a high starting point and has a noble goal, but its journey is just beginning. Spark is committed to building an open ecosystem. Https://, is willing to work together with everyone!

Author Wu Gansha, Intel China Research Institute chief engineer. His main research interests include networking, big data, facing the massive data processing of distributed embedded systems and emerging applications, using the service model and provide a supporting software environment.

step on

Recent activity


2015 China big data technology conference

In order to better help enterprises in-depth understanding of large data with the latest technology at home and abroad, master more industry data of practical experience, to further promote the big data technology innovation, industry application and talent cultivation, 10-12 December 2015, sponsored by the China Computer Federation (CCF), CO CCF large numbers according to the Expert Committee of the contractor, Chinese Academy of Sciences Institute of computing technology, and DNT branch in Beijing Science and Technology Co., Ltd. and CSDN 2015 China big data technology conference (big data technology conference 2015, BDTC 2015) will Yunnan new crown holiday hotel in Beijing held a grand.

Microblogging attention

Programmer mobile terminal subscription Download