Subscribe cloud computing RSS CSDN home> cloud computing

Dropbox Shao Zheng: how do I look at Summit Hadoop 2015 and Summit Spark 2015?

Published in10:23 2015-07-01| Time reading| sourceCSDN| ZeroArticle comments| authorShao Zheng

Abstract:Dropbox R & D Manager Shao Zheng has just participated in the Summit Hadoop 2015 and Summit Spark 2015. He has made a deep comparison of the two famous conferences in terms of scale, trend and technology, and put forward the key points of attention and learning suggestions.

[editor's note] Dropbox R & D Manager, Shao Zheng, is our2014 China big data technology conferenceChairman of the program. There are two important technical conferences in the United States.Summit Hadoop 2015 and Summit Spark 2015,China has a group to participate in the technical circles. We have a corresponding observation in the electronic journal 7A programmers. But at the same time to participate in two meetings, know little. When we discuss the comparison between the Spark and the Docker summit yesterday, we are still thinking about it. It seems that there are a lot of friends in common problems, just to see a post on the know:

How to look at Summit Hadoop 2015 Spark Summit and 2015?
First, IBM to fight in the spark above, it is said to be the whole company to participate in the spark inside, because spark is still in the start-up stage, so early ginseng into the inevitable benefits. Several other big companies such as MS, Google is how to see?
Schedule there are a lot of interesting topic, we think that those who are more interesting?
Because summit Hadoop last week also opened in Jose San, so the merger is a problem.

Dropbox R & D Manager Shao Zheng's point of view is worth the taste.

I took part in the two conference this year. Let me talk about my personal feelings.

First of all, IBM (as the landlord said) claimed that the fight on the Spark, but because IBM in the field of Data Big development is too slow (relative to Internet companies), so I would like to take the Spark train, catch up with the past. This in the end how much help to IBM company, I am not very optimistic. And several other large companies have similar technology layout, so there is no need to do that throw the helve after the hatchet like IBM.

The following is a detailed feeling.

Conference information

Summit Hadoop 2015:schedule,videotape,PPT
Summit Spark 2015:schedule(including video and PPT), complete video recording.A Track,B Track,C Track)

Conference scale

Big data in the continuous expansion of community. The number of participants in the two meeting this year has set a new high. Summit Hadoop 2015The number of participants is 4000, an increase of 30% (2014:31002013:26002012:21002011:16002010:1200).Summit Spark 2015 of the number of participants was 2000, an increase of 300% (2014:500). As can be seen, the number of participants in the Summit Hadoop is still accelerating growth, but the growth rate is far less than the Summit Spark. It is worth mentioning that the two meeting of the tickets are thousands of dollars, so so many of the number of participants is a good reflection of the current level of the big data. In addition, two conference has many companies of different industries own experiences, about Hadoop/Spark technology application can say big data has been in many industries roots.

About why Spark is developing faster than Hadoop, I think there are several reasons why:

1 Spark is very easy to use.Notebook Spark, the interoperability of Spark and Java/Scala/Python/R has done very well. And Hadoop's early users and the community's major contributors are from large companies, services in the senior user. Senior users are more concerned about whether the function is perfect, the system is stable, and ease of use is not a major consideration.

2 Spark is designed for interactive use.This is reflected in the focus on smaller data processing applications, and thus the use of memory to accelerate the very important. This is also reflected in the elimination of a lot of unnecessary overhead, such as JVM boot time, interval polling/heartbeat, used to prevent the emergence of sleep/wait Self-DDOS. And the Hadoop community's decision makers are a lot of big companies. In there, very large scale data computing is the most important, and the start time of a few seconds and wait are irrelevant.

3 Committer Spark pays great attention to the development of external code contributors.In the beginning, counseling external code contributors to submit patch may be more slow than Committer's own written code submit patch, but the guidance of external code contributors is a good investment, can have long-term returns. Obviously, this strategy is very effective in Spark body.

Relatively speaking, Spark technology is relatively new, so the stability of operation and maintenance, debugging and other aspects of the relevant technology is less than Hadoop. AMPLAB Berkeley this year on the special NSDI USENIX 2015 published an article on theSense of Performance in Data Analytics Frameworks MakingTo talk about how to debug Spark performance issues.


1 Hadoop technology to further mature.Hadoop recent relatively large progress is in the operation and maintenance of stability and performance, such as HA (Availablility for) YARN ResourceManager High, Upgrades Rolling,Coding Support inside HDFS ErasureWait Relatively speaking, the new function of the user is less.

2 Machine in Learning Data and Science/Statistics Spark users in the popularity of very fast.Notebook MLLib, Spark, SparkR is a few killer products of Spark. SparkSQL in the DataFrame is also a very effective function, but the SparkSQL in the Warehouse Data field (such as ETL, BI, etc.), the prospects for further observation, because SparkSQL, after all, is the successor.

3 Spark and Hadoop ecosystems in fusion.This can be seen.Spark & Perfect, Together Hadoop. Hadoop and Spark each have a lot of sub projects. For a large data for the senior user, he / she made the decision must not be in the end I used Hadoop or Spark, but I am in the end with Hadoop which components and Spark which components". So it is very important to understand the various sub projects of Hadoop and Spark.

I am most concerned about the technology

1 YARN.YARN is the Hadoop 2 computing resource management scheduling system, it can be said that Hadoop is the most important difference between YARN 1 and Hadoop 2. YARN from 2010 began to develop, in October 2013 released the first edition, until now has 5 years of history, so the technology is relatively mature, and can be used in the production environment. At present, the Hadoop Dropbox cluster is in the process of migrating to YARN.

Interested students are advised to read the first readYARN Blog Hadoop. Then focus on these new features: upgrades Support, for Long-running services Rolling (HBase, Storm, Kafka), for Docker containers support. In the future, YARN will have more monitoring and debugging features (such as server Next Generation Timeline), is also worth attention.

2 Hive and Stinger.StingerIs the biggest improvement in Hive 2013-2014,Known to increase the efficiency of the Hive 100 times. This year, this technology has been more mature, can be more stable in the production environment, the use of. Stinger contains three aspects of the improvement: ORCFile (optimized column storage), Execution Vectorized (vector computing),Tez (non DAG Map-Reduce execution system). This is the next step in the Hadoop Dropbox cluster upgrade target.

ThreeSparkR.I've learned some statistics before, and I like the R language's advantage in data processing. Spark combined with R and SparkR together, for students studying statistics, there will be a very big help. I suspect that after a lot of big data aspects of advanced applications (such as wind control modeling) will use SparkR.

FourTungsten Project.Tungsten will bring a very big upgrade to the performance of Spark. The main technical points are: the use of Storage Java and the removal of Object overhead Off-Heap, Computation Cache-aware, and Generation Code. This project is still in the process of development, the students are interested canParticipate in.


1 beginners and big data applications enthusiasts:Recommended from 2014Cloud Demo DatabricksStart, goCloud DatabricksRegistered users (click on the upper right corner of the "Up for Databricks Sign"), do some exercises to master the basic process of data processing.

2 data underlying technology developers:Recommended attentionTungsten ProjectalsoParticipate in.

3 senior data users:Suggested a lot of attention to the use of these technologies, such as the company's experience, for exampleFrom the Trenches: Letter An inside look at Hive at Yahoo. If you have not used Hive and Stinger YARN in the production environment, the proposed start considering the upgrade.

There are some good points, visibleThe original post. (Editor / Guo Xuemei)

step on
  • CSDN official micro channel
  • Scan two-dimensional code, to CSDN Tucao
  • Micro signal: CSDNnews
Programmer mobile terminal subscription Download

Microblogging attention

Related popular articles