[电子书]Big Data Analytics pdf下载

文章目录

1 本书的章节
2 详细目录
3 下载地址

本书作者Venkat Ankam，由Packt Publishing出版社在2016年09月发行，全书供326页。本书基于Spark 2.0和Hadoop 2.7版本介绍，是适合数据分析师和数据科学家的参考手册，当然也适合那些想入门的人。

如果想及时了解Spark、Hadoop或者Hbase相关的文章，欢迎关注微信公共帐号：iteblog_hadoop

本书的章节

Chapter 1: Big Data Analytics at a 10,000-Foot View
Chapter 2: Getting Started with Apache Hadoop and Apache Spark
Chapter 3: Deep Dive into Apache Spark
Chapter 4: Big Data Analytics with Spark SQL, DataFrames,and Datasets
Chapter 5: Real-Time Analytics with Spark Streaming and Structured Streaming
Chapter 6: Notebooks and Dataflows with Spark and Hadoop
Chapter 7: Machine Learning with Spark and Hadoop
Chapter 8: Building Recommendation Systems with Spark and Mahout
Chapter 9: Graph Analytics with GraphX
Chapter 10: Interactive Analytics with SparkR

详细目录

Preface
Chapter 1: Big Data Analytics at a 10,000-Foot View
　　Big Data analytics and the role of Hadoop and Spark
　　　　A typical Big Data analytics project life cycle
　　　　　　Identifying the problem and outcomes
　　　　　　Identifying the necessary data
　　　　　　Data collection
　　　　　　Preprocessing data and ETL
　　　　　　Performing analytics
　　　　　　Visualizing data
　　　　The role of Hadoop and Spark
　　Big Data science and the role of Hadoop and Spark
　　　　A fundamental shift from data analytics to data science
　　　　　　Data scientists versus software engineers
　　　　　　Data scientists versus data analysts
　　　　　　Data scientists versus business analysts
　　　　A typical data science project life cycle
　　　　　　Hypothesis and modeling
　　　　　　Measuring the effectiveness
　　　　　　Making improvements
　　　　　　Communicating the results
　　　　The role of Hadoop and Spark
　　Tools and techniques
　　Real-life use cases
　　Summary
Chapter 2: Getting Started with Apache Hadoop and Apache Spark
　　Introducing Apache Hadoop
　　　　Hadoop Distributed File System
　　　　Features of HDFS
　　　　MapReduce
　　　　MapReduce features
　　　　MapReduce v1 versus MapReduce v2
　　　　　　MapReduce v1 challenges
　　　　YARN
　　　　Storage options on Hadoop
　　　　　　File formats
　　　　　　Compression formats
　　Introducing Apache Spark
　　　　Spark history
　　　　What is Apache Spark?
　　　　What Apache Spark is not
　　　　MapReduce issues
　　　　Spark's stack
　　Why Hadoop plus Spark?
　　　　Hadoop features
　　　　Spark features
　　　　　　Frequently asked questions about Spark
　　Installing Hadoop plus Spark clusters
　　Summary
Chapter 3: Deep Dive into Apache Spark
　　Starting Spark daemons
　　　　Working with CDH
　　　　Working with HDP, MapR, and Spark pre-built packages
　　Learning Spark core concepts
　　　　Ways to work with Spark
　　　　　　Spark Shell
　　　　　　Spark applications
　　　　Resilient Distributed Dataset
　　　　　　Method 1 – parallelizing a collection
　　　　　　Method 2 – reading from a file
　　　　Spark context
　　　　Transformations and actions
　　　　Parallelism in RDDs
　　　　Lazy evaluation
　　　　Lineage Graph
　　　　Serialization
　　　　Leveraging Hadoop file formats in Spark
　　　　Data locality
　　　　Shared variables
　　　　Pair RDDs
　　Lifecycle of Spark program
　　　　Pipelining
　　　　Spark execution summary
　　Spark applications
　　　　Spark Shell versus Spark applications
　　　　Creating a Spark context
　　　　SparkConf
　　　　SparkSubmit
　　　　Spark Conf precedence order
　　　　Important application configurations
　　Persistence and caching
　　　　Storage levels
　　　　What level to choose?
　　Spark resource managers – Standalone, YARN, and Mesos
　　　　Local versus cluster mode
　　　　Cluster resource managers
　　　　　　Standalone
　　　　　　YARN
　　　　　　Mesos
　　　　　　Which resource manager to use?
　　Summary
Chapter 4: Big Data Analytics with Spark SQL, DataFrames,and Datasets
　　History of Spark SQL
　　Architecture of Spark SQL
　　Introducing SQL, Datasources, DataFrame, and Dataset APIs
　　Evolution of DataFrames and Datasets
　　　　What's wrong with RDDs?
　　　　RDD Transformations versus Dataset and DataFrames Transformations
　　Why Datasets and DataFrames?
　　　　Optimization
　　　　Speed
　　　　Automatic Schema Discovery
　　　　Multiple sources, multiple languages
　　　　Interoperability between RDDs and others
　　　　Select and read necessary data only
　　When to use RDDs, Datasets, and DataFrames?
　　Analytics with DataFrames
　　　　Creating SparkSession
　　　　Creating DataFrames
　　　　　　Creating DataFrames from structured data files
　　　　　　Creating DataFrames from RDDs
　　　　　　Creating DataFrames from tables in Hive
　　　　　　Creating DataFrames from external databases
　　　　Converting DataFrames to RDDs
　　　　Common Dataset/DataFrame operations
　　　　　　Input and Output Operations
　　　　　　Basic Dataset/DataFrame functions
　　　　　　DSL functions
　　　　　　Built-in functions, aggregate functions, and window functions
　　　　　　Actions
　　　　　　RDD operations
　　　　Caching data
　　　　Performance optimizations
　　Analytics with the Dataset API
　　　　Creating Datasets
　　　　Converting a DataFrame to a Dataset
　　　　　　Converting a Dataset to a DataFrame
　　　　Accessing metadata using Catalog
　　Data Sources API
　　　　Read and write functions
　　　　Built-in sources
　　　　　　Working with text files
　　　　　　Working with JSON
　　　　　　Working with Parquet
　　　　　　Working with ORC
　　　　　　Working with JDBC
　　　　　　Working with CSV
　　　　External sources
　　　　　　Working with AVRO
　　　　　　Working with XML
　　　　　　Working with Pandas
　　　　　　DataFrame based Spark-on-HBase connector
　　Spark SQL as a distributed SQL engine
　　　　Spark SQL's Thrift server for JDBC/ODBC access
　　　　Querying data using beeline client
　　　　Querying data from Hive using spark-sql CLI
　　　　Integration with BI tools
　　Hive on Spark
　　Summary
Chapter 5: Real-Time Analytics with Spark Streaming and Structured Streaming
　　Introducing real-time processing
　　　　Pros and cons of Spark Streaming
　　　　History of Spark Streaming
　　Architecture of Spark Streaming
　　　　Spark Streaming application flow
　　　　Stateless and stateful stream processing
　　Spark Streaming transformations and actions
　　　　Union
　　　　Join
　　　　Transform operation
　　　　updateStateByKey
　　　　mapWithState
　　　　Window operations
　　　　Output operations
　　Input sources and output stores
　　　　Basic sources
　　　　Advanced sources
　　　　Custom sources
　　　　Receiver reliability
　　　　Output stores
　　Spark Streaming with Kafka and HBase
　　　　Receiver-based approach
　　　　　　Role of Zookeeper
　　　　Direct approach (no receivers)
　　　　Integration with HBase
　　Advanced concepts of Spark Streaming
　　　　Using DataFrames
　　　　MLlib operations
　　　　Caching/persistence
　　　　Fault-tolerance in Spark Streaming
　　　　　　Failure of executor
　　　　　　Failure of driver
　　　　Performance tuning of Spark Streaming applications
　　Monitoring applications
　　Introducing Structured Streaming
　　　　Structured Streaming application flow
　　　　　　When to use Structured Streaming?
　　　　Streaming Datasets and Streaming DataFrames
　　　　　　Input sources and output sinks
　　　　Operations on Streaming Datasets and Streaming DataFrames
　　Summary
Chapter 6: Notebooks and Dataflows with Spark and Hadoop
　　Introducing web-based notebooks
　　Introducing Jupyter
　　　　Installing Jupyter
　　　　Analytics with Jupyter
　　Introducing Apache Zeppelin
　　　　Jupyter versus Zeppelin
　　　　Installing Apache Zeppelin
　　　　　　Ambari service
　　　　　　The manual method
　　　　Analytics with Zeppelin
　　The Livy REST job server and Hue Notebooks
　　　　Installing and configuring the Livy server and Hue
　　　　Using the Livy server
　　　　　　An interactive session
　　　　　　A batch session
　　　　　　Sharing SparkContexts and RDDs
　　　　Using Livy with Hue Notebook
　　　　Using Livy with Zeppelin
　　Introducing Apache NiFi for dataflows
　　　　Installing Apache NiFi
　　　　Dataflows and analytics with NiFi
　　Summary
Chapter 7: Machine Learning with Spark and Hadoop
　　Introducing machine learning
　　Machine learning on Spark and Hadoop
　　Machine learning algorithms
　　　　Supervised learning
　　　　Unsupervised learning
　　　　Recommender systems
　　　　Feature extraction and transformation
　　　　Optimization
　　　　Spark MLlib data types
　　An example of machine learning algorithms
　　　　Logistic regression for spam detection
　　Building machine learning pipelines
　　　　An example of a pipeline workflow
　　　　Building an ML pipeline
　　　　Saving and loading models
　　Machine learning with H2O and Spark
　　　　Why Sparkling Water?
　　　　An application flow on YARN
　　　　Getting started with Sparkling Water
　　Introducing Hivemall
　　Introducing Hivemall for Spark
　　Summary
Chapter 8: Building Recommendation Systems with Spark and Mahout
　　Building recommendation systems
　　　　Content-based filtering
　　　　Collaborative filtering
　　　　　　User-based collaborative filtering
　　　　　　Item-based collaborative filtering
　　Limitations of a recommendation system
　　A recommendation system with MLlib
　　　　Preparing the environment
　　　　Creating RDDs
　　　　Exploring the data with DataFrames
　　　　Creating training and testing datasets
　　　　Creating a model
　　　　Making predictions
　　　　Evaluating the model with testing data
　　　　Checking the accuracy of the model
　　　　Explicit versus implicit feedback
　　The Mahout and Spark integration
　　　　Installing Mahout
　　　　Exploring the Mahout shell
　　　　Building a universal recommendation system with
　　　　Mahout and search tool
　　Summary
Chapter 9: Graph Analytics with GraphX
　　Introducing graph processing
　　　　What is a graph?
　　　　Graph databases versus graph processing systems
　　　　Introducing GraphX
　　　　Graph algorithms
　　Getting started with GraphX
　　　　Basic operations of GraphX
　　　　　　Creating a graph
　　　　　　Counting
　　　　　　Filtering
　　　　　　inDegrees, outDegrees, and degrees
　　　　　　Triplets
　　　　Transforming graphs
　　　　　　Transforming attributes
　　　　　　Modifying graphs
　　　　　　Joining graphs
　　　　　　VertexRDD and EdgeRDD operations
　　　　GraphX algorithms
　　　　　　Triangle counting
　　　　　　Connected components
　　Analyzing flight data using GraphX
　　　　Pregel API
　　Introducing GraphFrames
　　　　Motif finding
　　　　Loading and saving GraphFrames
　　Summary
Chapter 10: Interactive Analytics with SparkR
　　Introducing R and SparkR
　　　　What is R?
　　　　Introducing SparkR
　　　　Architecture of SparkR
　　Getting started with SparkR
　　　　Installing and configuring R
　　　　Using SparkR shell
　　　　　　Local mode
　　　　　　Standalone mode
　　　　　　Yarn mode
　　　　　　Creating a local DataFrame
　　　　　　Creating a DataFrame from a DataSources API
　　　　　　Creating a DataFrame from Hive
　　　　Using SparkR scripts
　　Using DataFrames with SparkR
　　Using SparkR with RStudio
　　Machine learning with SparkR
　　　　Using the Naive Bayes model
　　　　Using the k-means model
　　Using SparkR with Zeppelin
　　Summary
Index