本书作者Venkat Ankam,由Packt Publishing出版社在2016年09月发行,全书供326页。本书基于Spark 2.0和Hadoop 2.7版本介绍,是适合数据分析师和数据科学家的参考手册,当然也适合那些想入门的人。
本书的章节
Chapter 1: Big Data Analytics at a 10,000-Foot View Chapter 2: Getting Started with Apache Hadoop and Apache Spark Chapter 3: Deep Dive into Apache Spark Chapter 4: Big Data Analytics with Spark SQL, DataFrames,and Datasets Chapter 5: Real-Time Analytics with Spark Streaming and Structured Streaming Chapter 6: Notebooks and Dataflows with Spark and Hadoop Chapter 7: Machine Learning with Spark and Hadoop Chapter 8: Building Recommendation Systems with Spark and Mahout Chapter 9: Graph Analytics with GraphX Chapter 10: Interactive Analytics with SparkR
详细目录
Preface Chapter 1: Big Data Analytics at a 10,000-Foot View Big Data analytics and the role of Hadoop and Spark A typical Big Data analytics project life cycle Identifying the problem and outcomes Identifying the necessary data Data collection Preprocessing data and ETL Performing analytics Visualizing data The role of Hadoop and Spark Big Data science and the role of Hadoop and Spark A fundamental shift from data analytics to data science Data scientists versus software engineers Data scientists versus data analysts Data scientists versus business analysts A typical data science project life cycle Hypothesis and modeling Measuring the effectiveness Making improvements Communicating the results The role of Hadoop and Spark Tools and techniques Real-life use cases Summary Chapter 2: Getting Started with Apache Hadoop and Apache Spark Introducing Apache Hadoop Hadoop Distributed File System Features of HDFS MapReduce MapReduce features MapReduce v1 versus MapReduce v2 MapReduce v1 challenges YARN Storage options on Hadoop File formats Compression formats Introducing Apache Spark Spark history What is Apache Spark? What Apache Spark is not MapReduce issues Spark's stack Why Hadoop plus Spark? Hadoop features Spark features Frequently asked questions about Spark Installing Hadoop plus Spark clusters Summary Chapter 3: Deep Dive into Apache Spark Starting Spark daemons Working with CDH Working with HDP, MapR, and Spark pre-built packages Learning Spark core concepts Ways to work with Spark Spark Shell Spark applications Resilient Distributed Dataset Method 1 – parallelizing a collection Method 2 – reading from a file Spark context Transformations and actions Parallelism in RDDs Lazy evaluation Lineage Graph Serialization Leveraging Hadoop file formats in Spark Data locality Shared variables Pair RDDs Lifecycle of Spark program Pipelining Spark execution summary Spark applications Spark Shell versus Spark applications Creating a Spark context SparkConf SparkSubmit Spark Conf precedence order Important application configurations Persistence and caching Storage levels What level to choose? Spark resource managers – Standalone, YARN, and Mesos Local versus cluster mode Cluster resource managers Standalone YARN Mesos Which resource manager to use? Summary Chapter 4: Big Data Analytics with Spark SQL, DataFrames,and Datasets History of Spark SQL Architecture of Spark SQL Introducing SQL, Datasources, DataFrame, and Dataset APIs Evolution of DataFrames and Datasets What's wrong with RDDs? RDD Transformations versus Dataset and DataFrames Transformations Why Datasets and DataFrames? Optimization Speed Automatic Schema Discovery Multiple sources, multiple languages Interoperability between RDDs and others Select and read necessary data only When to use RDDs, Datasets, and DataFrames? Analytics with DataFrames Creating SparkSession Creating DataFrames Creating DataFrames from structured data files Creating DataFrames from RDDs Creating DataFrames from tables in Hive Creating DataFrames from external databases Converting DataFrames to RDDs Common Dataset/DataFrame operations Input and Output Operations Basic Dataset/DataFrame functions DSL functions Built-in functions, aggregate functions, and window functions Actions RDD operations Caching data Performance optimizations Analytics with the Dataset API Creating Datasets Converting a DataFrame to a Dataset Converting a Dataset to a DataFrame Accessing metadata using Catalog Data Sources API Read and write functions Built-in sources Working with text files Working with JSON Working with Parquet Working with ORC Working with JDBC Working with CSV External sources Working with AVRO Working with XML Working with Pandas DataFrame based Spark-on-HBase connector Spark SQL as a distributed SQL engine Spark SQL's Thrift server for JDBC/ODBC access Querying data using beeline client Querying data from Hive using spark-sql CLI Integration with BI tools Hive on Spark Summary Chapter 5: Real-Time Analytics with Spark Streaming and Structured Streaming Introducing real-time processing Pros and cons of Spark Streaming History of Spark Streaming Architecture of Spark Streaming Spark Streaming application flow Stateless and stateful stream processing Spark Streaming transformations and actions Union Join Transform operation updateStateByKey mapWithState Window operations Output operations Input sources and output stores Basic sources Advanced sources Custom sources Receiver reliability Output stores Spark Streaming with Kafka and HBase Receiver-based approach Role of Zookeeper Direct approach (no receivers) Integration with HBase Advanced concepts of Spark Streaming Using DataFrames MLlib operations Caching/persistence Fault-tolerance in Spark Streaming Failure of executor Failure of driver Performance tuning of Spark Streaming applications Monitoring applications Introducing Structured Streaming Structured Streaming application flow When to use Structured Streaming? Streaming Datasets and Streaming DataFrames Input sources and output sinks Operations on Streaming Datasets and Streaming DataFrames Summary Chapter 6: Notebooks and Dataflows with Spark and Hadoop Introducing web-based notebooks Introducing Jupyter Installing Jupyter Analytics with Jupyter Introducing Apache Zeppelin Jupyter versus Zeppelin Installing Apache Zeppelin Ambari service The manual method Analytics with Zeppelin The Livy REST job server and Hue Notebooks Installing and configuring the Livy server and Hue Using the Livy server An interactive session A batch session Sharing SparkContexts and RDDs Using Livy with Hue Notebook Using Livy with Zeppelin Introducing Apache NiFi for dataflows Installing Apache NiFi Dataflows and analytics with NiFi Summary Chapter 7: Machine Learning with Spark and Hadoop Introducing machine learning Machine learning on Spark and Hadoop Machine learning algorithms Supervised learning Unsupervised learning Recommender systems Feature extraction and transformation Optimization Spark MLlib data types An example of machine learning algorithms Logistic regression for spam detection Building machine learning pipelines An example of a pipeline workflow Building an ML pipeline Saving and loading models Machine learning with H2O and Spark Why Sparkling Water? An application flow on YARN Getting started with Sparkling Water Introducing Hivemall Introducing Hivemall for Spark Summary Chapter 8: Building Recommendation Systems with Spark and Mahout Building recommendation systems Content-based filtering Collaborative filtering User-based collaborative filtering Item-based collaborative filtering Limitations of a recommendation system A recommendation system with MLlib Preparing the environment Creating RDDs Exploring the data with DataFrames Creating training and testing datasets Creating a model Making predictions Evaluating the model with testing data Checking the accuracy of the model Explicit versus implicit feedback The Mahout and Spark integration Installing Mahout Exploring the Mahout shell Building a universal recommendation system with Mahout and search tool Summary Chapter 9: Graph Analytics with GraphX Introducing graph processing What is a graph? Graph databases versus graph processing systems Introducing GraphX Graph algorithms Getting started with GraphX Basic operations of GraphX Creating a graph Counting Filtering inDegrees, outDegrees, and degrees Triplets Transforming graphs Transforming attributes Modifying graphs Joining graphs VertexRDD and EdgeRDD operations GraphX algorithms Triangle counting Connected components Analyzing flight data using GraphX Pregel API Introducing GraphFrames Motif finding Loading and saving GraphFrames Summary Chapter 10: Interactive Analytics with SparkR Introducing R and SparkR What is R? Introducing SparkR Architecture of SparkR Getting started with SparkR Installing and configuring R Using SparkR shell Local mode Standalone mode Yarn mode Creating a local DataFrame Creating a DataFrame from a DataSources API Creating a DataFrame from Hive Using SparkR scripts Using DataFrames with SparkR Using SparkR with RStudio Machine learning with SparkR Using the Naive Bayes model Using the k-means model Using SparkR with Zeppelin Summary Index
下载地址
关注本微信公众号iteblog_hadoop并回复 大数据分析 获取本书的下载地址。
点击进入下载本博客文章除特别声明,全部都是原创!
原创文章版权归过往记忆大数据(过往记忆)所有,未经许可不得转载。
本文链接: 【[电子书]Big Data Analytics pdf下载】(https://www.iteblog.com/archives/1880.html)