MasterofProject

Kicked off the big change (bottom): the framework of distributed computing and big data

label Distributed computingSpark
3861 people read comment(19) Collection report
Classification:

Because of large data processing needs, making our continuous expansion of computing capacity and cluster computing requirements lead to birth in the framework of distributed computing, cheap cluster computing resources within a short period of time to complete the past few weeks or even months of operation to wait, have said who mastered the huge data, who is leading the demand. Although in the past 10 years, through the accumulation of the past few decades, the birth of the MapReduce, the birth of the distributed file system, the birth of the dominant level spark, do not know this is not distributed computing framework, which is the end of, if there's a next generation framework that must come from larger scale data, I want to the magnitude is not today can imagine. First to study the current, go deeper, deeper, so the birth of the place is in AMPLab Spark, rather than an Internet business giant.


Immutable infrastructure

Dry cargo links

  1. How to better use the container technology to achieve the non variable infrastructure

  2. Immutable Infrastructure Why?

We in the local use eclipse Scala or IntelliJ idea writing good spark program and need to test the, in the test environment, we deployed operation spark software stack, and pay special attention to the various software version. If someone wants to use this program, in his environment, is different from the Stack Software, then the program is likely to fail. If we want in any machine does not cost the deployment and operation of energy we have developed the spark program, we use the docker operating system and software stack into a mirror, let the image package becomes a variable unit, then in any machine we only to deploy this example image and spark the development of applications can be run successfully.

The following figure from the above second links, second figure on the expression of the non variable infrastructure, the first is the traditional DevOps environment.

Write the picture here.

Write the picture here.


Tachyon

Tachyon profile

PASA big data lab, Nanjing University

SPARK/TACHYON: memory based distributed storage system


On Yarn Spark

  1. The whole process of building on yarn spark cluster- 3 to form a reference

  2. On Yarn Spark

  3. On YARN Spark cluster installation and deployment-recommend


The following is given Spark on YARN Ali ladder chart

Write the picture here.

"Spark operation based on the yarn first by the client generated job information, submitted to the ResourceManager. ResourceManager in a NodeManager report to AppMaster assigned to NodeManager, starting SparkAppMaster NodeManager, initialization operation after the start SparkAppMaster, and then apply for resources from the ResourceManager, apply to the corresponding resources SparkAppMaster via RPC to NodeManager start corresponding SparkExecutor, SparkExecutor report to the SparkAppMaster and complete the corresponding task. In addition, SparkClient will get the job running state by AppMaster."

The information from above
In-depth analysis of Alibaba ladder YARN cluster
Is a good dry cargo


1)Problems and repair when configuring Yarn Hadoop cluster:

On every machine (master and slave), to hadoop-env.sh and yarn-env.sh two at the end of the file to add (export) Java home environment variable (depending on the specific machine Java home).
After passing

~/hadoop- CDTwo point seven.1     # into the Hadoop directory
Namenode -format bin/hadoop# format namenode
Sbin/start-dfs.sh               # start DFS
Sbin/start-yarn.sh              # start yarn

After loginHttp://master:8088Slave node is found to be the unhealthy state, and then configure the configuration, in each machine (master and slave), modify the yarn-site.xml file, add the following: ()Not recommend!)

Name=yarn.nodemanager.disk-health-checker.enable
Value=false

Then restart the cluster after stop-all.sh on Master:

Sbin/start-dfs.sh               # start DFS
Sbin/start-yarn.sh              # start yarn

You will find that you have returned to normal.


2)Configuring spark-env.sh's spark

Note that the value of SPARK_LOCAL_DIRS on the master and the various slave should be the same, that is, spark on the same path of each machine.


3)Currently on the REHL 7.1 compiled into the Hadoop and can not run up on the SUSE


4)A variety of slaves files do not add localhost this one, if you do not want to let master machines have become worker to participate in cluster computing.


Hadoop compiler

Recommended dry cargo

APACHE HADOOP FROM SOURCE BUILDING


[1] installation of MVN
Do I install Maven with Yum How?

[2] installation of JAVA, set the environment variable

To root users /etc/profile VI or ~/.bashrc VI
According to their own path in the last to add the following three lines
#jdk
ExportJAVA_HOME=/usr/java/jdk1.Seven_67
ExportCLASSPATH=$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
ExportPATH=$JAVA_HOME/bin:$JAVA_HOME/jre/bin:$PATH:$HOME/bin
Add after execution
Source/etc/profile
perhaps
Source~/.bashrc

[3] the installation of protobuf Google
A.ProtoBuf Building(recommended)
B.Centos6.4 under the installation of protobuf and simple to use(Reference)

Check if the installation is successful version - protoc

[4] download the source code to compile

Compiler command

Clean package MVN-Pdist, native-DskipTests -Dtar

Hadoop compiler error

I was in the JAVA IBM environment to compile the hadoop. Lists errors in the compilation process and the solution, for your reference.

1)Antrun

To execute goal Failed
Org.apache.maven.plugins:maven-antrun-plugin:1.6:run (create-testdirs)

Http://stackoverflow.com/questions/17126213/building-hadoop-with-maven-failed-to-execute-goal-org-apache-maven-pluginsma

Chown-RUsernameParent-directory
(such as chown-RRoot./)
Install MVN-DskipTests

2)Failed with JVM IBM JAVA on TestSecureLogins Build

Com.sun.security.auth.module does not exist package

Https://issues.apache.org/jira/browse/HADOOP-11783

This is specifically for the JAVA IBM environment in the fight patch.


3)After the above two fix if the show soon build success, and in (hypothesis to download the source code folder name for hadoop-release-2.7.1) list of hadoop-release-2.7.1/hadoop-dist/target/ no name for the hadoop-2.7.1.tar.gz tar package, that did not compile successfully, return to the hadoop-release-2.7.1 the root directory, continue to implement:

Package MVN-Pdist -DskipTests -Dtar

Http://www.iteblog.com/archives/897

The compile time was longer, you spent in this thrilling time:)


YARN cluster operation SparkPi error

In yarn-cluster mode

Hdfs.DFSClient: DFSOutputStream ResponseProcessor exception WARNFor BlockBP-xxx:blk_1073741947_1123
Bad response ERROR_CHECKSUM java.io.IOException:For BlockBP-xxx:blk_1073741947_1123FromXxxxx: datanodeFifty thousand and ten

In thread java.io.IOException: "main" All datanodes Exception
Are bad. Aborting xxxxx:50010...
At
DataStreamer.setupPipelineForAppendOrRecovery org.apache.hadoop.hdfs.DFSOutputStream (DFSOutputStream.java:1206)
At
DataStreamer.processDatanodeError org.apache.hadoop.hdfs.DFSOutputStream (DFSOutputStream.java:1004)
At
DataStreamer.run org.apache.hadoop.hdfs.DFSOutputStream (DFSOutputStream.java:548)

The above error is due to the problem of the size of the IBM mainframe, the need for aPatch.
Or through the combination of heterogeneous platform to solve (the introduction of X86 machine).However, if you want to make a big machine as a worker to be patch.


Run successful display:

Write the picture here.


Standalone master Spark single node runtime error

If assigned to driver spark memorySPARK_DRIVER_MEMORY(in the spark-env.sh settings), for example, only set up the 1G, it is possible that the following error, I changed to 20G after the can be avoided. If not enough, GC will do a lot of cleaning work, not only a great consumption of CPU, and will appear to run the failure.

Sixteen/01/Twenty-four 09:Fifty-nine:Thirty-sixSpark WARN.HeartbeatReceiver: executor driver with no recent heartbeats: RemovingFour hundred and thirty-six thousand four hundred and fiftyExceeds timeout MSOne hundred and twenty thousandMS
Sixteen/01/Twenty-four 09:Fifty-nine:FortyAkka WARN.AkkaRpcEndpointRef: sending message [message = Heartbeat (driver, [Lscala, Error.TupleTwo; @3570311d, BlockManagerId (driver, localhost, 54709)] in 1 attempts
Org.apache.spark.rpc.RpcTimeoutException:期货时间之后一百二十秒]。此超时被控制的火花RPC。asktimeout。
造成的:爪哇之后,,同时。TimeoutException:期货之后一百二十秒]
在Scala,同时实现。承诺defaultpromise美元准备(允诺Scala。二百一十九)
在Scala,同时实现。承诺defaultpromise美元结果(允诺Scala。二百二十三)
在Scala,同时等待结果anonfun美元美元1美元应用(包Scala。一百零七)
在Scala,同时blockcontext。defaultblockcontext美元美元blockon。(blockcontextScala。五十三)
在Scala,同时等待美元结果(包Scala。一百零七)
在组织Apache。火花RPC。rpctimeout。awaitresult。(rpcenvScala。二百四十一
十六/01/二十四 09五十九四十错误scheduler.taskschedulerimpl:失去了执行器驱动程序本地:遗嘱执行人的心跳时间  四十三万六千四百五十MS
例外螺纹“主” 十六/01/二十四 06四十九信息storage.blockmanagermaster:尝试 寄存器blockmanager
org.apache.spark.sparkexception:工作流产了阶段失败:任务 阶段失败时代,最近的失败:失去的任务 阶段(2009,localhost):executorlostfailure(执行器驱动丢失)

火花对内存的消耗主要分为三部分(即取决于你的应用程序的需求):

  1. 数据集中对象的大小
  2. 访问这些对象的内存消耗
  3. 垃圾回收GC的消耗

由网络或者GC引起,工人或执行没有接收到执行任务的心跳反馈或,导致执行和任务丢失,这时要提高spark.network.timeout的值,根据情况改成300(5min)或更高。


模式运行出错纱簇

Designing and Implementing a ScalienDB: Distributed Database using Paxos

由于所运行的程序需要第三方罐子,而没有引入导致。
解决:
使用–罐和–档案添加应用程序所依赖的第三方罐包,如

火花•提交 ——班级火花- HelloWorld ——主纱•集群 ——罐范围火花火花- HelloWorld

同时检查资源分配参数的设置,以防因为资源分配不够导致运行失败。

猜你在找
查看评论
*以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
    个人资料
    • 访问:180777次
    • 积分:四千零九十
    • 等级:
    • 排名:4224名第
    • 原创:143篇
    • 转载:18篇
    • 译文:3篇
    • 评论:98条
    博客专栏
    文章分类