9:00am 大数据就业直通车(Pyhton/Spark/Hadoop)

大数据就业直通车(Pyhton/Spark/Hadoop)

（更多资料和具体参加方法）

Part 1: Python for Big Data

Course Objective

· Master the fundamentals of writing Python scripts

· Learn Python programming elements such as variables and flow control structures

· Discover how to work with Python data structure lists and dictionary data

· Write Python functions to facilitate code reuse

· Use Python to handle data in files, xml, json and databases

· Make their code robust by handling errors and exceptions properly

· Work with the Python libraries

· Explore Python's object-oriented features

· Search text using regular expressions

· Data visualization

· Hands on Practice Web scrapping to collection data

· Hands on Practice Web development to represent information

· Learn Data engineer roadmap

Outline

Python programming

· Introduction to Python

· Setup development IDE

· Debug

· Strings

· Numbers

· Control Structure

Data Structure

· List

· Dictionary

Data Operations

· Flat file

· JSON

· XML

· RDBMS

Regular expression

· Matching

· Searching

· Searching and modifying

Common used Libraries

· numPy

· pandas

· sciPy

Application Design

· Requirements Collection

· Configuration Design

· Log mechanism

· Data Migration and validation Practice

Web Scrapping

· Web Scrapping basics

· Web Scrapping advanced

· The Yahoo! Finance Stock Quote Server

Data Visualization

· Introduction

o Data

o Information

o Knowledge

o Data analysis and insight

· Data Analysis and Visualization

o Planning visualization

o Visualization tools

· Visualization Practice

o Health Care

o Sports

o Trends over time

o Financial and Statistical

Web Apps Development

· Web Frameworks

· Building a Social Website

· Sharing Content in Website

· Tracking User Actions

Data Engineer roadmap

Part 2: Big Data Solution - Spark

Course overview

Data scientists/engineer/analyst build information platform to provide deep insight and answer previously unimaginable questions. Spark and Hadoop are transforming how data scientists/engineer/analyst works by allowing interactive and integrative data analysis at scale.

You will learn how Spark and Hadoop enable data scientists/engineer/analyst to help companies reduce costs, increase profits, improve products, retain customers, and identify the new opportunities.

You will learn what data scientists/engineer/analyst do, the problems they solve, the tools and techniques they use. Through in-class simulations, participates apply data analysis methods to real-world challenges in different industries and, ultimately, prepare for big data application development and big data analyst roles in the field.

Outline

Part I Fundamental

Module 1 - Spark Introduction and Basic Programming

Introduction Spark

What is Spark?

A brief History of Spark

Programming with RDDs

Module 2 - Advanced Spark Programming

Spark Storage - Loading and saving data

Advanced Spark Programming

Standalone applications

Module 3 - Spark SQL

Linking with Spark SQL

Using Spark SQL in Applications

JDBC/ODBC server

User-Defined Functions

Spark SQL Performance

Module 4 - Spark Streaming

Architecture and abstraction

Input/output operations

Streaming UI

Performance Considerations

Module 5 - Tuning and Debug Spark

Configuration Spark

Key Performance considerations

Module 6 - Running on Cluster

Runtime Architecture

Cluster Manager

Part II Applications

Module 7 - Machine Learning

Designing a Machine learning system

Building a Recommendation Engine with Spark

MLlib Decision Trees

Module 8 – Prediction with Decision tree

Decision tree

Training Examples
Preparing the data

A First Decision tree

Tuning Decision Trees

Making Predictions

Conclusions

Module 9 – Anomaly Detection with K-means Clustering

Anomaly Detection

K-means clustering

A First Take on Clustering

Choosing k

Visualization

Feature Normalisation

Clustering in action

Module 10 – Exploring Property Location data

Loading data

Variables to explore

Exploring property value

Exploring lot size

Exploring costs

Exploring the year a property has been built

Exploring rent and income

Module 11 - Estimating Financial Risk through Mote Carlo Simulation

Build model

Getting the data

Preprocessing

Determine the factor Weights

Visualizing the results

Evaluating results

Module 12 - Interactive Data Analysis with Zeppelin

Appendix Scala programming Essential

Part 3: Big Data Solution - Hadoop

Introduction Big Data

All about Data!

Data Storage and Analysis

Comparison with Other Systems

Rational Database Management System

Grid Computing

Volunteer Computing

A Brief History of Hadoop

Compatibility

Installation single node Hadoop

Prerequisites Installation Configuration Standalone Mode

Pseudo distributed Mode Configuration SSH Formatting HDFS filesystem

Starting and stopping MapReduce

Fully Distributed Mode

Creating Eclipse Plugin for Hadoop-2.x.0

Contents

Download and install Eclipse

Install git

Download source code for Hadoop Plugin for Eclipse from git

Compile and create jar

Install the plugin to eclipse

Developing a MapReduce Application

The Configuration Combining Resources Variable Expansion

Setting Up the Development Environment Managing Configuration GenericOptionsParser, Tool, and ToolRunne

Writing a Unit Test with MRUnit

Mapper

Reducer

Running Locally on Test Data Running a Job in a Local Job Runner Testing the Driver

Running on a Cluster Packaging a Job Launching a Job

The MapReduce Web UI Retrieving the Results Debugging a Job

Hadoop Logs Remote Debugging Tuning a Job Profiling Tasks

MapReduce Workflows

Decomposing a Problem into MapReduce Jobs

JobControl

Apache Oozie

MapReduce Features

Counters

Built-in Counters

User-Defined Java Counters

User-Defined Streaming Counters

Sorting Preparation Partial Sort Total Sort Secondary Sort Joins

Map-Side Joins

Reduce-Side Joins

Side Data Distribution

Using the Job Configuration Distributed Cache MapReduce Library Classes

Setting Up a Hadoop Cluster

Cluster Specification

Network Topology

Cluster Setup and Installation

Installing Java

Creating a Hadoop User Installing Hadoop Testing the Installation SSH Configuration

Hadoop Configuration

Configuration Management

Environment Settings

Important Hadoop Daemon Properties Hadoop Daemon Addresses and Ports Other Hadoop Properties

User Account Creation

YARN Configuration

Important YARN Daemon Properties YARN Daemon Addresses and Ports Security

Kerberos and Hadoop

Delegation Tokens

Other Security Enhancements Benchmarking a Hadoop Cluster Hadoop Benchmarks

User Jobs

Hadoop in the Cloud

Apache Whirr

Administering Hadoop

HDFS

Persistent Data Structures

Safe Mode Audit Logging Tools Monitoring Logging Metrics

Java Management Extensions

Maintenance

Routine Administration Procedures Commissioning and Decommissioning Nodes Upgrades

Pig

Installing and Running Pig

Execution Types Running Pig Programs Grunt

Pig Latin Editors An Example Generating Examples

Comparison with Databases

Pig Latin Structure Statements Expressions Types Schemas Functions Macros

User-Defined Functions

A Filter UD An Eval UDF A Load UDF

Data Processing Operators Loading and Storing Data Filtering Data

Grouping and Joining Data

Sorting Data

Combining and Splitting Data

Pig in Practice

Parallelism

Parameter Substitution

Hive

Installing Hive The Hive Shell An Example Running Hive

Configuring Hive

Hive Services

The Metastore

Comparison with Traditional Databases Schema on Read Versus Schema on Write Updates, Transactions, and Indexes HiveQL

Data Types

Operators and Functions

Tables

Managed Tables and External Tables

Partitions and Buckets

Storage Formats

Importing Data Altering Tables Dropping Tables Querying Data

Sorting and Aggregating

MapReduce Scripts

Joins Subqueries Views

User-Defined Functions

Writing a UDF Writing a UDAF

HBase

HBasics Backdrop Concepts

Whirlwind Tour of the Data Model

Implementation

Installation Test Drive Clients

Java

Avro, REST, and Thrift

Example Schemas Loading Data Web Queries

HBase Versus RDBMS

Successful Service

HBase

Use Case: HBase at Streamy.com

Praxis Versions HDFS

UI Metrics

Schema Design

Counters

Bulk Load

Case Studies

Hadoop Usage at Last.fm

Last.fm: The Social Music Revolution

Hadoop at Last.fm

Generating Charts with Hadoop The Track Statistics Program Summary

周六课程

维多利亚教育中心 - 热线电话：416-665-1888
Toronto: 250 Consumers Road, Suite 901, Toronto, Ontario, Canada M2J 4V6
Mississauga: Unit 129, 1140 Burnhamthorpe Road West, Mississauga, Ontario L5C 4E6
Copyright © 2009-2017 Victoria Toronto Training Center. All rights reserved.

本页最后更新: | -- | 网站设计和虚拟主机服务 WECAN