Data Engineering with PySpark
Data Engineering has become an important role in the Data Science space. For Data Analysts to do productive work, they need to have consistent datasets to analyze. A Data Engineer provides this consistency for analysts by accessing data in a variety of formats, using a variety of tools. This class will introduce programmers to tools for ETL applications as well as big data applications using Apache Spark. Participants will gain experience with PySpark, the Spark SQL module, and DataFrames.
Audience
This course is suitable for: Software Developers, Data Scientists, and anyone needing to manipulate large datasets.
Prerequisites
Students should have a general background in programming and/or data processing; ability to learn a new language (Python) by doing stepwise exercises.
Chapter 1. Defining Data Engineering
• What is Data Engineering?
• How is it different from Data Science?
Chapter 2. The Data Engineer Role
• The scope of the DE role
• Data Scientists, Machine Learning Specialists, and Data Engineers
Chapter 3. Data Processing Phases
• Data Ingestion
• Data Cleansing
Chapter 4. Distributed Computing Concepts
• Data Physics
• CAP Theorem
• Hadoop
Chapter 5. Apache Spark
• Supported Languages
• Distributed Data Processing with PySpark
Chapter 6. Apache Spark Dev Environments
• Spark Shells
• Jupyter Notebooks
Chapter 7. Introduction to Functional Programming
• Why I need Functional Programming?
• Functional Programming with Python
Chapter 8. Functional Programming using Spark RDD API
• RDD Transformations and Actions
• Data Partitioning
Chapter 9. ETL Jobs with RDD
• Using map-reduce FP for Data Processing
Chapter 10. Spark SQL DataFrames
• What are DataFrames?
• Relationship with RDDs
• Ways to Create DataFrames
• Schema of Datasets
• Inferring the Schema
Chapter 11. SQL-centric Programming using DataFrames API
• Using the sql Method, and the Native DataFrame API
• Data Aggregation
Chapter 12. ETL Jobs with DataFrames
• Using Spark SQL DataFrame API
• Contrasting with Spark RDD API
Chapter 13. Repairing and Normalizing Data
• What May Be Wrong With My Data?
• Detecting and Removing Bad Data
Chapter 14. Data Visualization with seaborn
• Exploratory Data Analysis
• Available Options for producing graphs
Chapter 15. Working with Various File Formats: CSV, Parquet, ORC, and JSON
• What is Columnar Data Storage Formats?
• Comparing Various Formats
• Ways to Read and Store Data in Various Formats
Is there a discount available for current students?
UMBC students and alumni, as well as students who have previously taken a public training course with UMBC Training Centers are eligible for a 10% discount, capped at $250. Please provide a copy of your UMBC student ID or an unofficial transcript or the name of the UMBC Training Centers course you have completed. Asynchronous courses are excluded from this offer.
What is the cancellation and refund policy?
Student will receive a refund of paid registration fees only if UMBC Training Centers receives a notice of cancellation at least 10 business days prior to the class start date for classes or the exam date for exams.