Kicked off the big change (bottom): the framework of distributed computing and big data

label Distributed computingSpark
3861 people read comment(19) Collection report

Because of large data processing needs, making our continuous expansion of computing capacity and cluster computing requirements lead to birth in the framework of distributed computing, cheap cluster computing resources within a short period of time to complete the past few weeks or even months of operation to wait, have said who mastered the huge data, who is leading the demand. Although in the past 10 years, through the accumulation of the past few decades, the birth of the MapReduce, the birth of the distributed file system, the birth of the dominant level spark, do not know this is not distributed computing framework, which is the end of, if there's a next generation framework that must come from larger scale data, I want to the magnitude is not today can imagine. First to study the current, go deeper, deeper, so the birth of the place is in AMPLab Spark, rather than an Internet business giant.

Immutable infrastructure

Dry cargo links

  1. How to better use the container technology to achieve the non variable infrastructure

  2. Immutable Infrastructure Why?

We in the local use eclipse Scala or IntelliJ idea writing good spark program and need to test the, in the test environment, we deployed operation spark software stack, and pay special attention to the various software version. If someone wants to use this program, in his environment, is different from the Stack Software, then the program is likely to fail. If we want in any machine does not cost the deployment and operation of energy we have developed the spark program, we use the docker operating system and software stack into a mirror, let the image package becomes a variable unit, then in any machine we only to deploy this example image and spark the development of applications can be run successfully.

The following figure from the above second links, second figure on the expression of the non variable infrastructure, the first is the traditional DevOps environment.

Write the picture here.

Write the picture here.


Tachyon profile

PASA big data lab, Nanjing University

SPARK/TACHYON: memory based distributed storage system

On Yarn Spark

  1. The whole process of building on yarn spark cluster- 3 to form a reference

  2. On Yarn Spark

  3. On YARN Spark cluster installation and deployment-recommend

The following is given Spark on YARN Ali ladder chart

Write the picture here.

"Spark operation based on the yarn first by the client generated job information, submitted to the ResourceManager. ResourceManager in a NodeManager report to AppMaster assigned to NodeManager, starting SparkAppMaster NodeManager, initialization operation after the start SparkAppMaster, and then apply for resources from the ResourceManager, apply to the corresponding resources SparkAppMaster via RPC to NodeManager start corresponding SparkExecutor, SparkExecutor report to the SparkAppMaster and complete the corresponding task. In addition, SparkClient will get the job running state by AppMaster."

The information from above
In-depth analysis of Alibaba ladder YARN cluster
Is a good dry cargo

1)Problems and repair when configuring Yarn Hadoop cluster:

On every machine (master and slave), to and two at the end of the file to add (export) Java home environment variable (depending on the specific machine Java home).
After passing

~/hadoop- CDTwo point seven.1     # into the Hadoop directory
Namenode -format bin/hadoop# format namenode
Sbin/               # start DFS
Sbin/              # start yarn

After loginHttp://master:8088Slave node is found to be the unhealthy state, and then configure the configuration, in each machine (master and slave), modify the yarn-site.xml file, add the following: ()Not recommend!)


Then restart the cluster after on Master:

Sbin/               # start DFS
Sbin/              # start yarn

You will find that you have returned to normal.

2)Configuring's spark

Note that the value of SPARK_LOCAL_DIRS on the master and the various slave should be the same, that is, spark on the same path of each machine.

3)Currently on the REHL 7.1 compiled into the Hadoop and can not run up on the SUSE

4)A variety of slaves files do not add localhost this one, if you do not want to let master machines have become worker to participate in cluster computing.

Hadoop compiler

Recommended dry cargo


[1] installation of MVN
Do I install Maven with Yum How?

[2] installation of JAVA, set the environment variable

To root users /etc/profile VI or ~/.bashrc VI
According to their own path in the last to add the following three lines
Add after execution

[3] the installation of protobuf Google
A.ProtoBuf Building(recommended)
B.Centos6.4 under the installation of protobuf and simple to use(Reference)

Check if the installation is successful version - protoc

[4] download the source code to compile

Compiler command

Clean package MVN-Pdist, native-DskipTests -Dtar

Hadoop compiler error

I was in the JAVA IBM environment to compile the hadoop. Lists errors in the compilation process and the solution, for your reference.


To execute goal Failed
Org.apache.maven.plugins:maven-antrun-plugin:1.6:run (create-testdirs)


(such as chown-RRoot./)
Install MVN-DskipTests

2)Failed with JVM IBM JAVA on TestSecureLogins Build does not exist package


This is specifically for the JAVA IBM environment in the fight patch.

3)After the above two fix if the show soon build success, and in (hypothesis to download the source code folder name for hadoop-release-2.7.1) list of hadoop-release-2.7.1/hadoop-dist/target/ no name for the hadoop-2.7.1.tar.gz tar package, that did not compile successfully, return to the hadoop-release-2.7.1 the root directory, continue to implement:

Package MVN-Pdist -DskipTests -Dtar


The compile time was longer, you spent in this thrilling time:)

YARN cluster operation SparkPi error

In yarn-cluster mode

Hdfs.DFSClient: DFSOutputStream ResponseProcessor exception WARNFor BlockBP-xxx:blk_1073741947_1123
Bad response ERROR_CHECKSUM BlockBP-xxx:blk_1073741947_1123FromXxxxx: datanodeFifty thousand and ten

In thread "main" All datanodes Exception
Are bad. Aborting xxxxx:50010...
DataStreamer.setupPipelineForAppendOrRecovery org.apache.hadoop.hdfs.DFSOutputStream (
DataStreamer.processDatanodeError org.apache.hadoop.hdfs.DFSOutputStream (
At org.apache.hadoop.hdfs.DFSOutputStream (

The above error is due to the problem of the size of the IBM mainframe, the need for aPatch.
Or through the combination of heterogeneous platform to solve (the introduction of X86 machine).However, if you want to make a big machine as a worker to be patch.

Run successful display:

Write the picture here.

Standalone master Spark single node runtime error

If assigned to driver spark memorySPARK_DRIVER_MEMORY(in the settings), for example, only set up the 1G, it is possible that the following error, I changed to 20G after the can be avoided. If not enough, GC will do a lot of cleaning work, not only a great consumption of CPU, and will appear to run the failure.

Sixteen/01/Twenty-four 09:Fifty-nine:Thirty-sixSpark WARN.HeartbeatReceiver: executor driver with no recent heartbeats: RemovingFour hundred and thirty-six thousand four hundred and fiftyExceeds timeout MSOne hundred and twenty thousandMS
Sixteen/01/Twenty-four 09:Fifty-nine:FortyAkka WARN.AkkaRpcEndpointRef: sending message [message = Heartbeat (driver, [Lscala, Error.TupleTwo; @3570311d, BlockManagerId (driver, localhost, 54709)] in 1 attempts
十六/01/二十四 09五十九四十错误scheduler.taskschedulerimpl:失去了执行器驱动程序本地:遗嘱执行人的心跳时间  四十三万六千四百五十MS
例外螺纹“主” 十六/01/二十四 06四十九信息storage.blockmanagermaster:尝试 寄存器blockmanager
org.apache.spark.sparkexception:工作流产了阶段失败:任务 阶段失败时代,最近的失败:失去的任务 阶段(2009,localhost):executorlostfailure(执行器驱动丢失)


  1. 数据集中对象的大小
  2. 访问这些对象的内存消耗
  3. 垃圾回收GC的消耗



Designing and Implementing a ScalienDB: Distributed Database using Paxos


火花•提交 ——班级火花- HelloWorld ——主纱•集群 ——罐范围火花火花- HelloWorld


    • 访问:180777次
    • 积分:四千零九十
    • 等级:
    • 排名:4224名第
    • 原创:143篇
    • 转载:18篇
    • 译文:3篇
    • 评论:98条