Subscribe cloud computing RSS CSDN home> cloud computing

[OpenCloud2015] highlights held three Summit on April 18th Spark grand debut concert

Published in10:12 2015-04-21| Time reading| sourceCSDN| ZeroArticle comments| authorGuo Xuemei

Abstract:Spark technology summit, databricks, bat, IBM, Intel, Microsoft, AsiaInfo, cloudera 10 lecturers from, a full share of spark of dry, covering the ecology and the development direction of the spark, different SparkSQL, graphx, MLLib and other components in different institutions application difficulties and optimization experience.

Sponsored by CSDN, CSDN expert advisory group to support the OpenCloud 2015 conference in Beijing National Convention Center successfully held. Full of dry goods 2015 openstack technology conference, 2015 spark technology summit, 2015Container Technology Summit strength to win the recognition of all the audience. CSDN cloud computing special combing live topic, together for more exciting content, lecturer, reader. That the Spark summit highlights.The Container summit highlightsSync issue,The OpenStack summit highlightsHere in. (Lecturer PDF in the final statistics, this week will be issued, please note that the @CSDN cloud computing micro channel notice)

Morning highlights

09:00 2015 Spark technology summit by the yen burst table seven cattle technical director Chen Chao presided over. After seeing a lot of students standing in the class, Chen Chao Spark for the development of the situation is pleased.

OCC 09:00 second 2015 days, 2015 Spark technology summit, Streaming Tathagata responsible person Das Spark was the first to share. TD first shared the status of the Spark in 2014: the contributor, from 150 to 500; code from 190 thousand lines to 370 thousand lines. At the same time, Spark has been deployed in more than 500 production environments. Then TD summed up the focus of the 2014 spark: enterprise applications; richer library; extended more and higher performance of the core engine; wider out of the box scene. And revealed the direction of the development of Spark in 2015: machine learning, for more people to use; more rich platform interface.

09:30 2015 spark technology summit of the second lecture from Microsoft Asia Research Institute researcher Zhou Hucheng. He shared the theme is spark ecosystem and applications inside Microsoft, he combined with the SparkSQL, graphx, MLLib components, with share the spark inside Microsoft ecosystem building experience.

10:30 Tencent senior engineer Wang Lianhui in-depth share of the application and practice of Tencent Spark optimization". At the beginning of this year, the TDW Tencent (tcehy distributed data warehouse) spark cluster has reached the following scale: Gaia cluster nodes, RMB8000 +; HDFS storage space, 150PB+; every new data, 1PB+; every day the number of tasks, 1M+; daily amount of calculation, 10PB+. Wang Lianhui said that Tencent has started from the 2013 version of the Spark 0.6, using the current version of Spark1.2. Typical applications in three areas: predicting the user's ad Click probability; calculating the number of common friends between two friends; SparkSQL and DAG tasks for ETL. Optimization, Tencent to do more in-depth. Such as application development experience; for the ETL job using dynamic resource expansion shrinkage characteristics; Redcue stage in map stage was not completed before the implementation of; partition number based on the data for prediction of the size of the stage; for each session of the SparkSQL assigned a driver, count (distinct) optimization; based on the sort of GroupBy/Join.

11:10 engineers Databrciks, spark Committer, spark SQL is one of the main developers of Liancheng detailed interpretation of the "spark SQL structured data analysis". He introduced a lot of new features in the Spark1.3 version. Focus on the introduction of DataFrame. Its evolved from the SchemaRDD, to provide a more high-level abstraction of the API, in the form and R and Python is very similar. DataFrame vs.RDD Spark, somewhat similar to the difference between dynamic and static language, in many scenarios, DataFrame advantage is more obvious. In the 1.3 edition, Spark further improve the external data source API, and intelligent optimization. Through light and abstract, DataFrame supports various types of data sources, such as support for Hive, S3, HDFS, Hadoop, Parquet, MySQL, HBase, dBase, etc., so it is easy to carry out various types of data analysis on its basis. Core Spark than the amount of Hadoop code to streamline a lot, SQL Spark code more streamlined, so much more readable.

11:50 Baidu Senior Software Engineer Ma Xiaolong speech content is Spark in Baidu's engineering practice to share, the main coverage of Baidu's Spark and Baidu public cloud Spark two parts. In the explanation of Tahyon, Ma Xiaolong first shared the Baidu facing the problem, that is why to use Tachyon: data nodes and computing nodes may not be in the same data center; cross data center access delay. And share the Baidu solution: the use of Transparent as Cache Layer Tachyon; Query Cold read data from the remote storage node; Query Hot read directly from the Tachyon. Through the above efforts, Baidu finally in the query 10X to get the Warm\hot + performance upgrade.

The afternoon highlights

13:20 2015 spark technology summit, afternoon game of the first lecturers are from Alibaba Taobao technology senior technical experts Huang and his share of the theme is "figure flow wall: Spark streaming and graphx dynamic graph calculation based on", he first of all on the graphx and Streaming+MLlib development was the introduction, but in practice that clean out treasure, they also encountered the new problems and challenges. In the flow graph is amalgamative the advantages he summed up the two points: model delicate, compared to the use of ordinary operator can be through the strong operator, obtain better accuracy and efficiency; performance optimization, the graph operator can avoid RDD time-consuming operations. In the flow graph is amalgamative attention. He emphasizes the following points: resources guarantee: streaming tasks for long, the rational allocation of the core and the worker, memory, must guarantee for the most part, don't appear serious delay; spikes and fluctuations: online in real environment, the amount of data per cycle will fluctuate phenomenon; when switching data source, data completion will also generate spiked; first according to the N cycle before operation every cycle input per cycle and the amount of data processing time, the calculated threshold processing ability of the system, the next Zhou Qigen according to the threshold for peak processing. Feign death: message delivered in May will be too much that homework feign death, message limit the size required; data accumulation: when a cycle of input data, beyond the processing capacity of the system, will be postponed to the next cycle of data processing, the data will be accumulation; create a data buffer pool achieve peak, according to the input data quantity of each cycle estimated processing time, if estimated processing time is greater than the threshold time, part of the excess into the buffer pool, if estimated time is less than a threshold time, from the buffer pool release ratio of the corresponding data.

14:30 cloudera, a senior architect Phil Tian Feng Zhan Tian's speech on the theme is the spark driven intelligent data analysis application, for the spark, he believes that spark will replace the MapReduce becomes Hadoop common computational framework. This is mainly because: in with Hadoop community well integrated at the same time, spark moment has been more extensive support from the community and provider; excellence in scientific data and machine learning. During the speech, Dr. Tian through specific cases of multiple companies to show the spark of value: conviva through the real-time analysis of traffic patterns and the flow more precise control, optimize the end users of online video experience, for conviva. The main value of the spark is rapid prototyping, sharing of offline and online computation business logic, open-source machine learning algorithm; Yahoo through spark accelerated advertising model training pipeline, feature extraction improve 3x, use collaborative filtering content recommendation, for them the main value of the spark is to reduce the data pipeline delay, iterative machine learning, efficient P2P broadcast.

14:50 Intel big data technology center R & D Manager Huang Jie on the Spark's memory management, IO upgrade and calculate and optimize the 3 aspects of a detailed explanation. The interactive survey found that nearly 80% of the hundreds of people on the site said they had or are ready to use the Spark. In this 80% of the guests, 10% of the friends expect to use Spark to do advanced machine learning and graph analysis, 10% of the friends expect to do complex interactive OLAP/BI, 10% of the friends want to do real-time flow calculation. For Spark, Huang Jie said, it will become an important role in big data, but also will become the main platform for the next generation of IA big data.

15:20 following morning "new directions for spark in 2015" speech, spark streaming project leader Tathagata Das we introduced spark streaming over the past year the function update, practical application examples and the future new features. TD said in the past year, Streaming Python in API Stream, MLlib Spark algorithm, Steam API Kafka, Library and Infrastructure System have been updated. In practical application, Pearson, Pearson Education Publishing Group, big data solutions provider guavus and video site Netflix are in their respective business application the spark streaming. Pearson from the early Storm turned to Spark, the use of Spark combined with student activities and events to update the student learning model, and Netflix is a real-time analysis of the trend of TV and movies. In the future, TD revealed that Streaming Spark will be in the library, ease of use and performance of the business to upgrade.

AsiaInfo 16:00 big data platform technology R & D department manager Tian Yi focused on sharing the practice of multiple projects. For example, based on the transformation of Spark user tag analysis platform. Initial communication data and Internet data, through the database, TCL script, SQL to achieve exploration, monitoring and analysis. There are many problems: label quantity is more and more big, database workload is too high, extended high cost; table label number of columns with the tag number increasing increased, part of the site to 2000, only through the table to solve, queries need to join operation; tag and index calculation can not get rid of SQL constraints, can not be quickly integrated machine learning algorithm. The first transformation is to replace the SQL+HDFS SQL Spark. Benefits are obvious: SparkSQLParquet scheme of effectively guarantee the query efficiency; the original system basically do not have too big alteration; query system with parallel scalability. But there are also some new problems, such as increasing the poured out data from the database, the additional steps of loading to the HDFS; increase the conversion from text data to additional steps of parquet format. Second transformation of the original database into the HDFS, the TCL script for SparkSQL. Not only the expansion of the whole system to further enhance, and two sets of SparkSQL can be different according to their busy busy, sharing the whole system of computing resources. Wait until after the release of 1.3.0 External, Datasource API Spark to further enhance; DataFrame provides a rich variety of data source support; DataFrame provides a set of DSL for manipulating data. These help project completely get rid of the tag analysis algorithm for SQL dependence, the front end can also be extracted by the ExtDatasource data, reduce the ETL dependence on the system. And DF based processing program code is only the original program 1/10, greatly improve the readability. The same in-depth analysis of the project as well as Streaming Spark transformation content recognition platform, etc..

IBM 16:40 China Research Institute senior researcher Chen Guancheng brings the theme of OpenStack, Docker and Spark to build SuperVessel big data public cloud speech. According to Chen Guancheng introduction, SuperVessel is a public cloud built in OpenStack and Power7/Power8, providing as Service Spark, Service Docker and CompuNng Service CogniNve and other services. Why choose Docker and Spark technology to build SuperVessel public cloud, he also gave an explanation. There are two reasons for the choice of OpenStack: 1 community activists, community contributors and other competitors beyond the other 2 support Docker. Docker choice has three reasons: 1. Resource occupancy rate is far less than the KVM. 2. The start is very fast, 3. Can gradually build, recovery and reuse containers; spark selection based on four reasons: 1. Soon, the unity 2. And 3. Ecological systems are developing rapidly, 4.porting to power. At the end of the summary, he said Spark+OpenStack+Docker on the OpenPower server can be a good run, Docker services can make Devops more simple, he also stressed the attention to monitor everything.

After obtaining the consent of the instructor, the General Assembly lecturer PPT will soon be announced to the outside world, please continue to pay attention to our CSDN cloud computing micro channel.

step on