Databricks, Cloudera, Ali, Tencent Spark practice of PPT

Tencent Ali

CSDN· 2016-04-18 20:53:24

in 2014, sponsored by the CSDN China Spark Technology Summit has been successfully held twice. And in 2016, the summit has been more Spark escort Databricks support, all topics were jointly by Databricks and summit chairman Chen Chao. In 2016 China Spark Technology Summit, you can not only gains include databricks, hortonworks, Intel, elastic, Tencent, Sina, admaster and well-known enterprises at home and abroad to share the first hand experience, also can face the spark the open-source stack PM, senior PMC spark into inspector ram Sriharsha (databricks), and face face communication.

at the same time, to include "openstack summit", "container summit", "big data core technology and practical application of summit" < strong > China Cloud Computing Technology Conference will be held in the same period. From the beginning of 2010, CSDN already continuously participated in and held six years China Cloud Computing Conference, at home and abroad 700 + well-known lecturer took to the podium, 20000 + high quality participants, including Internet, education, finance, telecommunications, intelligent transportation, electric power, manufacturing, medical and other industries, witnessed the cloud computing technology industries and landing in China's development process.


in 2015 China Spark Technology Summit, a dozen experts shared the new spark practice, and now from the PPT take you to do a simple review. Click to download the PPT


A: Spark SQL, Databricks engineer Liancheng structured data analysis of


Databricks

Liancheng detailed interpretation of the analysis of Spark SQL data". He introduced a lot of new features in the Spark1.3 version. Focus on the introduction of DataFrame. Its evolved from the SchemaRDD, to provide a more high-level abstraction of the API, in the form of R and Python is very similar. DataFrame vs.RDD Spark, somewhat similar to the difference between dynamic and static language, in many scenarios, DataFrame advantage is more obvious. In the 1.3 edition, Spark further improve the external data source API, and intelligent optimization. Through light and abstract, DataFrame supports various types of data sources, such as support for Hive, S3, HDFS, Hadoop, Parquet, MySQL, HBase, dBase, etc., so it is easy to carry out various types of data analysis on its basis. Core Spark than the amount of Hadoop code to streamline a lot, SQL Spark code more streamlined, so much more readable.


two, Intel big data technology R & D Center Manager Huang Jie: Spark optimization and practical experience to share the


Intel big data technology R & D Center Manager Huang Jie

Huang Jie Spark memory management, IO upgrade and optimization of 3 aspects in detail explain. The interactive survey found that nearly 80% of the hundreds of people on the site said they had or are ready to use the Spark. In this 80% of the guests, 10% of the friends expect to use Spark to do advanced machine learning and graph analysis, 10% of the friends expect to do complex interactive OLAP/BI, 10% of the friends want to do real-time flow calculation. For Spark, Huang Jie said, it will become an important role in big data, but also will become the main platform for the next generation of IA big data.


three, Cloudera senior architect Tian Fengzhan: intelligent data analysis using

Cloudera senior architect Tian Feng style= ">

Tian Fengzhan's speech on the theme of intelligent data analysis applications, Spark driver for Spark, he believes that Spark will replace MapReduce as a general Hadoop computing framework this is mainly because: in the Hadoop community and good integration At the same time, Spark has now been more extensive community and provider support; excellent data science and machine learning, etc.. During the speech, Dr. Tian through specific cases of multiple companies to show the spark of value: conviva through the real-time analysis of traffic patterns and the flow more precise control, optimize the end users of online video experience, for conviva. The main value of the spark is rapid prototyping, sharing of offline and online computation business logic, open-source machine learning algorithm; Yahoo through spark accelerated advertising model training pipeline, feature extraction improve 3x, use collaborative filtering content recommendation, for them the main value of the spark is to reduce the data pipeline delay, iterative machine learning, efficient P2P broadcast.


four, IBM Chinese Research Institute senior researcher Chen Guancheng: OpenStack, Docker and Spark SuperVessel to build a large data public cloud based on  


IBM Chinese Research Institute senior fellow at the Chen Guancheng


Chen Guancheng introduction, SuperVessel is a building on the OpenStack and Power7/Power8 public cloud, to provide Spark as Service, Docker Service and CogniNve Com Service puNng and other services. Why choose Docker and Spark technology to build SuperVessel public cloud, he also gave an explanation. There are two reasons for the choice of OpenStack: 1 community activists, community contributors and other competitors beyond the other 2 support Docker. Docker choice has three reasons: 1. Resource occupancy rate is far less than the KVM. 2. The start is very fast, 3. Can gradually build, recovery and reuse containers; spark selection based on the following four reasons: 1. Soon, the unity 2. And 3. Ecological systems are developing rapidly, 4.porting to power. At the end of the summary, he said Spark+OpenStack+Docker on the OpenPower server to be able to run well, Docker services to make Devops more simple, he also stressed the attention to monitor everything.


five, senior engineer Wang Lianhui Tencent: Tencent in the Spark on the application and practice of optimization of

Tencent senior engineer Wang Lianhui
Wang Lianhui shared the Tencent in-depth application and practice of optimization based on Spark". Early 2015, TDW Tencent (tcehy distributed data warehouse) spark cluster has reached the following scale: Gaia cluster nodes, RMB8000 +; HDFS storage space, 150PB+; every new data, 1PB+; every day the number of tasks, 1M+; daily amount of calculation, 10PB+. Wang Lianhui said that the Tencent has started from the 2013 version of the Spark 0.6, the Spark1.2 version was used at the time. Typical applications in three areas: predicting the user's ad Click probability; calculating the number of common friends between two friends; SparkSQL and DAG tasks for ETL. Optimization, the Tencent to do more in-depth. Such as application development experience; for the ETL job using dynamic resource expansion shrinkage characteristics; Redcue stage in map stage was not completed before the implementation of; partition number based on the data for prediction of the size of the stage; for each session of the SparkSQL assigned a driver, count (distinct) optimization; based on the sort of GroupBy/Join.


six, Alibaba Taobao technical department senior technical experts Huang Ming:


Alibaba Taobao technical department senior technical expert Huang Ming

Huang Ming share the theme of the flow diagram of wall: Calculation of dynamic graph Streaming and GraphX Spark based on his first development of GraphX and Streaming + MLlib were Introduction, but in the process of Taobao practice, they also encountered new problems and challenges. In the flow graph is amalgamative the advantages he summed up the two points: model delicate, compared to the use of ordinary operator can be through the strong operator, obtain better accuracy and efficiency; performance optimization, the graph operator can avoid RDD time-consuming operations. In the flow graph is amalgamative attention. He emphasizes the following points: resources guarantee: streaming tasks for long, the rational allocation of the core and the worker, memory, must guarantee for the most part, don't appear serious delay; spikes and fluctuations: online in real environment, the amount of data per cycle will fluctuate phenomenon; when switching data source, data completion will also generate spiked; first according to the N cycle before operation every cycle input per cycle and the amount of data processing time, the calculated threshold processing ability of the system, the next Zhou Qigen according to the threshold for peak processing. Feign death: message delivered in May will be too much that homework feign death, message limit the size required; data accumulation: when a cycle of input data, beyond the processing capacity of the system, will be postponed to the next cycle of data processing, the data will be accumulation; create a data buffer pool achieve peak, according to the input data quantity of each cycle estimated processing time, if estimated processing time is greater than the threshold time, part of the excess into the buffer pool, if estimated time is less than a threshold time, from the buffer pool release ratio of the corresponding data.


seven, AsiaInfo platform for big data technology R & D department manager Tian Yi: Spark platform in the application of telecommunications operators


AsiaInfo platform for big data technology R & D department manager Tian Yi

Tian Yi focused on sharing the practice of multiple projects. For example, based on the transformation of Spark user tag analysis platform. Initial communication data and Internet data, through the database, TCL script, SQL to achieve exploration, monitoring and analysis. There are many problems: label quantity is more and more big, database workload is too high, extended high cost; table label number of columns with the tag number increasing increased, part of the site to 2000, only through the table to solve, queries need to join operation; tag and index calculation can not get rid of SQL constraints, can not be quickly integrated machine learning algorithm. The first transformation is to replace the SQL+HDFS SQL Spark. Benefits are obvious: SparkSQLParquet scheme of effectively guarantee the query efficiency; the original system basically do not have too big alteration; query system with parallel scalability. But there are also some new problems, such as increasing the poured out data from the database, the additional steps of loading to the HDFS; increase the conversion from text data to additional steps of parquet format. Second transformation of the original database into the HDFS, the TCL script for SparkSQL. Not only the expansion of the whole system to further enhance, and two sets of SparkSQL can be different according to their busy busy, sharing the whole system of computing resources. Wait until after the release of 1.3.0 External, Datasource API Spark to further enhance; DataFrame provides a rich variety of data source support; DataFrame provides a set of DSL for manipulating data. These help project completely get rid of the tag analysis algorithm for SQL dependence, the front end can also be extracted by the ExtDatasource data, reduce the ETL dependence on the system. And DF based processing program code is only the original program 1/10, greatly improve the readability. The same in-depth analysis of the project as well as the transformation of content identification platform based on Streaming Spark.


2016 China Spark Technology Summit will be held in May 15 this year, Beijing, session tickets limited discount. For details, click on the [reading the text.

The lastest articles of CSDN

Depth study of -LeCun, Bengio and Hinton joint review (under)

Depth study of -LeCun, Bengio and Hinton joint review (on)

Open source big data engine: Greenplum database architecture analysis

BAT gathered in the SDCC2016 Architecture & operation Summit (Chengdu Railway...

Meet in Chengdu! SDCC2016 Architecture & Operation Technology Summit, the...

The father of the Internet revolution calls for software engineers to...

Jingdong 618 technical analysis of the high availability of multi center...

2016Spark old jinshanfeng sidelights

In addition to LinkedIn, Microsoft's five major acquisition

DaoliNet announced the open source for the Docker container volume to create...