昨天分享了《[电子书]Apache Spark 2 for Beginners pdf下载》,这本书很适合入门学习Spark,虽然书名上写着是Apache Spark 2,但是其内容介绍几乎和Spark 2毫无关系,今天要分享的图书也是一本适合入门的Spark电子书,也是Packt出版,2016年09月开始发行的,全书共339页,其面向读者是数据科学家,本书内容涵盖了Spark编程模型、DataFrame介绍、统一数据访问、机器学习、结构化数据分析、大数据可视化等知识。
本书的章节
Chapter 1: Big Data and Data Science – An Introduction Chapter 2: The Spark Programming Model Chapter 3: Introduction to DataFrames Chapter 4: Unified Data Access Chapter 5: Data Analysis on Spark Chapter 6: Machine Learning Chapter 7: Extending Spark with SparkR Chapter 8: Analyzing Unstructured Data Chapter 9: Visualizing Big Data Chapter 10: Putting It All Together Chapter 11: Building Data Science Applications
详细目录
Preface Chapter 1: Big Data and Data Science – An Introduction Big data overview Challenges with big data analytics Computational challenges Analytical challenges Evolution of big data analytics Spark for data analytics The Spark stack Spark core Spark SQL Spark streaming MLlib GraphX SparkR Summary References Chapter 2: The Spark Programming Model The programming paradigm Supported programming languages Scala Java Python R Choosing the right language The Spark engine Driver program The Spark shell SparkContext Worker nodes Executors Shared variables Flow of execution The RDD API RDD basics Persistence RDD operations Creating RDDs Transformations on normal RDDs The filter operation The distinct operation The intersection operation The union operation The map operation The flatMap operation The keys operation The cartesian operation Transformations on pair RDDs The groupByKey operation The join operation The reduceByKey operation The aggregate operation Actions The collect() function The count() function The take(n) function The first() function The takeSample() function The countByKey() function Summary References Chapter 3: Introduction to DataFrames Why DataFrames? Spark SQL The Catalyst optimizer The DataFrame API DataFrame basics RDDs versus DataFrames Similarities Differences Creating DataFrames Creating DataFrames from RDDs Creating DataFrames from JSON Creating DataFrames from databases using JDBC Creating DataFrames from Apache Parquet Creating DataFrames from other data sources DataFrame operations Under the hood Summary References Chapter 4: Unified Data Access Data abstractions in Apache Spark Datasets Working with Datasets Creating Datasets from JSON Datasets API's limitations Spark SQL SQL operations Under the hood Structured Streaming The Spark streaming programming model Under the hood Comparison with other streaming engines Continuous applications Summary References Chapter 5: Data Analysis on Spark Data analytics life cycle Data acquisition Data preparation Data consolidation Data cleansing Missing value treatment Outlier treatment Duplicate values treatment Data transformation Basics of statistics Sampling Simple random sample Systematic sampling Stratified sampling Data distributions Frequency distributions Probability distributions Descriptive statistics Measures of location Mean Median Mode Measures of spread Range Variance Standard deviation Summary statistics Graphical techniques Inferential statistics Discrete probability distributions Bernoulli distribution Binomial distribution Sample problem Poisson distribution Sample problem Continuous probability distributions Normal distribution Standard normal distribution Chi-square distribution Sample problem Student's t-distribution F-distribution Standard error Confidence level Margin of error and confidence interval Variability in the population Estimating sample size Hypothesis testing Null and alternate hypotheses Chi-square test F-test Problem: Correlations Summary References Chapter 6: Machine Learning Introduction The evolution Supervised learning Unsupervised learning MLlib and the Pipeline API MLlib ML pipeline Transformer Estimator Introduction to machine learning Parametric methods Non-parametric methods Regression methods Linear regression Loss function Optimization Regularizations on regression Ridge regression Lasso regression Elastic net regression Classification methods Logistic regression Linear Support Vector Machines (SVM) Linear kernel Polynomial kernel Radial Basis Function kernel Sigmoid kernel Training an SVM Decision trees Impurity measures Gini Index Entropy Variance Stopping rule Split candidates Categorical features Continuous features Advantages of decision trees Disadvantages of decision trees Example Ensembles Random forests Advantages of random forests Gradient-Boosted Trees Multilayer perceptron classifier Clustering techniques K-means clustering Disadvantages of k-means Example Summary References Chapter 7: Extending Spark with SparkR SparkR basics Accessing SparkR from the R environment RDDs and DataFrames Getting started Advantages and limitations Programming with SparkR Function name masking Subsetting data Column functions Grouped data SparkR DataFrames SQL operations Set operations Merging DataFrames Machine learning The Naive Bayes model The Gaussian GLM model Summary References Chapter 8: Analyzing Unstructured Data Sources of unstructured data Processing unstructured data Count vectorizer TF-IDF Stop-word removal Normalization/scaling Word2Vec n-gram modelling Text classification Naive Bayes classifier Text clustering K-means Dimensionality reduction Singular Value Decomposition Principal Component Analysis Summary References: Chapter 9: Visualizing Big Data Why visualize data? A data engineer's perspective A data scientist's perspective A business user's perspective Data visualization tools IPython notebook Apache Zeppelin Third-party tools Data visualization techniques Summarizing and visualizing Subsetting and visualizing Sampling and visualizing Modeling and visualizing Summary References Data source citations Chapter 10: Putting It All Together A quick recap Introducing a case study The business problem Data acquisition and data cleansing Developing the hypothesis Data exploration Data preparation Too many levels in a categorical variable Numerical variables with too much variation Missing data Continuous data Categorical data Preparing the data Model building Data visualization Communicating the results to business users Summary References Chapter 11: Building Data Science Applications Scope of development Expectations Presentation options Interactive notebooks References Web API References PMML and PFA References Development and testing References Data quality management The Scala advantage Spark development status Spark 2.0's features and enhancements Unifying Datasets and DataFrames Structured Streaming Project Tungsten phase 2 What's in store? The big data trends Summary References Index
下载地址
关注本微信公众号iteblog_hadoop并回复Spark2_data获取本书的下载地址。或
本博客文章除特别声明,全部都是原创!原创文章版权归过往记忆大数据(过往记忆)所有,未经许可不得转载。
本文链接: 【[电子书]Spark for Data Science PDF下载】(https://www.iteblog.com/archives/1854.html)