Welcome to the Data Engineering Nanodegree Program
Pre-requisites
How to Succeed3:32
Access the Career Portal
How Do I Find Time for My Nanodegree?
Introduction to Data Engineering
What do Data Engineers do?1:44
What do Data Engineers do? 26:02
A Brief History
Data Engineering Tools
Introduction to Data Modeling
02. What is Data Modeling?4:20
Test
03. Why is Data Modeling Important?1:44
Test
04. Who does this type of work?0:28
05. Intro to Relational Databases3:43
Test
06. Relational Databases
07. When to use a relational database?3:17
08. ACID Transactions3:46
08. Test
09. When Not to Use a Relational Database2:31
09. Test
10. What is PostgreSQL?0:53
11. Demos: Creating a Postgres Table10:57
12. Exercise 1: Creating a Table with Postgres
13. Solution for Exercise 1: Create a Table with Postgres4:52
14. NoSQL Databases4:30
14. Test
15. What is Apache Cassandra?0:31
15. Test
16. When to Use a NoSql Database3:03
17. When Not to Use a NoSql Database2:33
18. Demo 2: Creating table with Cassandra7:37
19. Exercise 2: Create table with Cassandra
20. Solution for Exercise 2: Create table with Cassandra4:55
21. Conclusion0:31
Relational Data Models
02. Databases0:41
03. Importance of Relational Databases2:15
04. OLAP vs OLTP1:13
05. Quiz 1
06. Structuring the Database: Normalization2:58
07. Objectives of Normal Form1:54
08. Quiz
08. Normal Forms6:48
09. Demo 1: Creating Normalized Tables11:52
10. Exercise 1: Creating Normalized Tables
12. Denormalization3:56
12. Test
13. Demo 2: Creating Denormalized Tables00:00
14. Denormalization Vs. Normalization
15. Exercise 2: Creating Denormalized Tables
17. Fact and Dimension Tables2:46
18. Star Schema0:37
19. Benefits of Star Schemas00:00
20. Snowflake Schemas0:56
21. Demo 3: Creating Fact and Dimension Tables00:00
22. Exercise 3: Creating Fact and Dimension Tables
24. Data Definition and Constraints
25. Upsert
26. Conclusion1:06
Project Data Modeling with Postgres
Project Datasets
Project Instructions
Project Description – Data Modeling with Postgres
Project Rubric – Data Modeling with Postgres
NoSQL Data Models
02. Non-Relational Databases2:50
03. Distributed Databases2:33
04. CAP Theorem1:27
05. Quiz 1
06. Denormalization in Apache Cassandra5:14
06. Quiz
07. CQL1:01
08. Demo 15:40
09. Exercise 1
10. Exercise 1 Solution
11. Primary Key3:33
12. Primary Key
13. Demo 200:00
14. Exercise 2
15. Exercise 2: Solution
16. Clustering Columns00:00
16. Quiz
17. Demo 300:00
18. Exercise 3
19. Exercise 3: Solution
20. WHERE Clause00:00
20. Quiz
21. Demo 45:08
22. Exercise 4
22.1 Solution
23. Lesson Wrap Up00:00
Course Wrap Up00:00
Project Data Modeling with Apache Cassandra
Introduction
Project Details
Project Workspace
Project Description – Data Modeling with Apache Cassandra
Project Rubric – Data Modeling with Apache Cassandra
Introduction to Data Warehouses
01. Course Introduction0:40
02. Lesson Introduction1:06
04. Operational vs. Analytical Processes3:10
04. Quiz
05. Data Warehouse: Technical Perspective4:20
03. Data Warehouse: Business Perspective3:30
06. Dimensional Modeling3:28
06. Quiz
07. ETL Demo: Step 1 & 23:45
08. Exercise 1: Step 1 & 200:00
09. ETL Demo: Step 34:16
10. Exercise 1: Step 300:00
11. ETL Demo: Step 41:49
12. Exercise 1: Step 400:00
13. ETL Demo: Step 56:05
14. Exercise 1: Step 500:00
15. ETL Demo: Step 600:00
16. Exercise 1: Step 600:00
17. DWH Architecture: Kimball’s Bus Architecture00:00
17. Quiz
18. DWH Architecture: Independent Data Marts00:00
18. Quiz
19. DWH Architecture: CIF00:00
19. Quiz
20. DWH Architecture: Hybrid Bus & CIF00:00
20. Quiz
21. OLAP Cubes00:00
22. OLAP Cubes: Roll-Up and Drill Down00:00
23. OLAP Cubes: Slice and Dice00:00
23. Quiz
24. OLAP Cubes: Query Optimization00:00
25. OLAP Cubes Demo: Slicing & Dicing00:00
26. Exercise 2: Slicing & Dicing
27. OLAP Cubes Demo: Roll-Up00:00
28. Exercise 2: Roll-Up & Drill Down
29. OLAP Cubes Demo: Grouping Sets00:00
30. Exercise 2: Grouping Sets
31. OLAP Cubes Demo: CUBE00:00
32. Exercise 2: CUBE
33. Data Warehouse Technologies00:00
34. Demo: Column format in ROLA
35. Exercise 3: Column format in ROLAP
Introduction to Cloud Computing and AWS
01. Lesson Introduction
02. Cloud Computing00:00
03. Amazon Web Services1:02
04. AWS Setup Instructions
05. Create an IAM Role
06. Create Security Group
07. Launch a Redshift Cluster
08. Create an IAM User
09. Delete a Redshift Cluster
10. Create an S3 Bucket
11. Upload to S3 Bucket
12. Create PostgreSQL RDS
13. Avoid Paying Unexpected Costs for AWS
Implementing Data Warehouses on AWS
01. Lesson Introduction1:32
Data Warehouse: A Closer Look00:00
03. Choices for Implementing a Data Warehouse00:00
03. Quiz
04. DWH Dimensional Model Storage on AWS00:00
05. Amazon Redshift Technology00:00
05. Quiz
06. Amazon Redshift Architecture00:00
06. Quiz
07. Redshift Architecture Example00:00
08. SQL to SQL ETL00:00
08. Quiz
09. SQL to SQL ETL – AWS Case00:00
09. Quiz
10. Redshift & ETL in Context00:00
10. Quiz
11. Ingesting at Scale00:00
11. Quiz
12. Redshift ETL Examples00:00
12. Quiz
13. Redshift ETL Continued00:00
13. Quiz
14. Redshift Cluster Quick Launcher1:37
Exercise 1: Launch Redshift Cluster
16. Problems with the Quick Launcher2:20
17. Infrastructure as Code on AWS2:59
17. Quiz
18. Enabling Programmatic Access fo IaC00:00
19. Demo: Infrastructure as Code00:00
20. Exercise 2: Infrastructure as Code
21. Exercise Solution 2: Infrastructure as Code
22. Demo: Parallel ETL00:00
23. Exercise 3: Parallel ETL
24. Exercise Solution 3: Parallel ETL
25. Optimizing Table Design00:00
26. Distribution Style: Even00:00
26. Quiz
27. Distribution Style: All00:00
27. Quiz
28. Distribution Syle: Auto00:00
29. Distribution Syle: Key00:00
29. Quiz
30. Sorting Key00:00
Sorting Key Example00:00
32. Demo: Table Design00:00
33. Exercise 4: Table Design
34. Exercise Solution 4: Table Design
35. Conclusion00:00
Project: Data Warehouse
Introduction
Project Details
Project Instructions
Environment
Project Description – Data Warehouse
Project Rubric – Data Warehouse
The Power of Spark
01. Introduction1:25
02. What is Big Data?1:24
02. Quiz
03. Numbers Everyone Should Know00:00
03. Quiz
04. Hardware: CPU00:00
04. Quiz
05. Hardware: Memory00:00
Hardware Memory 200:00
05. Quiz
06. Hardware: Storage00:00
07. Hardware: Network00:00
07. Quiz
08. Hardware: Key Ratios00:00
09. Small Data Numbers00:00
10. Big Data Numbers00:00
10.Quiz
Big Data Numbers Part 200:00
11. Medium Data Numbers00:00
12. History of Distributed Computing00:00
12. Quiz
13. The Hadoop Ecosystem00:00
14. MapReduce3:00
14. Quiz
15. Hadoop MapReduce [Demo]
16. The Spark Cluster2:22
16. Quiz
17. Spark Use Cases1:30
18. Summary00:00
Data Wrangling with Spark
01. Introduction1:12
02. Functional Programming1:40
03. Why Use Functional Programming1:19
03. Quiz
04. Procedural Example1:24
05. Procedural [Example Code]
06. Pure Functions in the Bread Factory2:31
07. The Spark DAGs: Recipe for Data2:14
08. Maps and Lambda Functions3:37
09. Maps and Lambda Functions [Example Code]
10. Data Formats2:22
11. Distributed Data Stores1:15
12. SparkSession1:17
13. Reading and Writing Data into Spark Data Frames3:57
15. Imperative vs Declarative programming00:00
16. Data Wrangling with DataFrames00:00
17. Data Wrangling with DataFrames Extra Tips
18. Data Wrangling with Spark [Example Code]
19. Quiz – Data Wrangling with DataFrames
19. Quiz
20. Quiz – Data Wrangling with DataFrames Jupyter Notebook
21. Quiz [Solution Code]
22. Spark SQL0:56
23. Example Spark SQL2:12
24. Example Spark SQL [Example Code]
25. Quiz – Data Wrangling with SparkSQL
26. Quiz [Spark SQL Solution Code]
27. RDDs00:00
28. Summary00:00
Debugging and Optimization
01. Introduction00:00
02. Setup Instructions AWS00:00
03. From Local to Standalone Mode00:00
04. Spark Scripts00:00
05. Submitting Spark Scripts00:00
06. Storing and Retrieving Data on the Cloud00:00
07. Reading and Writing to Amazon S3 Part 100:00
07. Reading and Writing to Amazon S3 Part 200:00
07. Reading And Writing To Amazon S3 Part 300:00
08. Introduction to HDFS00:00
09. Reading and Writing Data to HDFS00:00
10. Recap Local Mode to Cluster Mode00:00
11. Debugging is Hard00:00
12. Syntax Errors00:00
13. Code Errors00:00
14. Data Errors00:00
15. Data Errors00:00
16. Debugging your Code00:00
17. How to Use Accumulators00:00
18. Spark WebU00:00
19. Connecting to the Spark Web UI00:00
20. Getting Familiar with the Spark UI00:00
21. Review of the Log Data00:00
22. Diagnosing Errors Part I00:00
23. Diagnosing Errors Part 200:00
24. Diagnosing Errors Part 300:00
25. Optimization Introduction00:00
26. Understanding Data Skew00:00
27. Understanding Big O Complexity00:00
28. Other Issues and How to Address Them
29. Lesson Summary00:00
Introduction to Data Lakes
01. Introduction00:00
02. Lesson Overview00:00
03. Why Data Lakes: Evolution of the Data Warehouse00:00
04. Why Data Lakes: Unstructured & Big Data00:00
05. Why Data Lakes: New Roles & Advanced Analytics00:00
06. Big Data Effects: Low Costs, ETL Offloading00:00
07. Big Data Effects: Schema-on-Read00:00
08. Big Data Effects: (Un-/Semi-)Structured support2:41
09. Demo: Schema On Read Pt 12:43
10. Demo: Schema On Read Pt 22:54
11. Demo: Schema On Read Pt 300:00
12. Demo: Schema On Read Pt 400:00
13. Exercise 1: Schema On Read
14. Demo: Advanced Analytics NLP Pt 100:00
15. Demo: Advanced Analytics NLP Pt 22:25
16. Demo: Advanced Analytics NLP Pt 300:00
17. Exercise 2: Advanced Analytics NLP
18. Data Lake Implementation Introduction00:00
19. Data Lake Concepts00:00
20. Data Lake vs Data Warehouse00:00
21. AWS Setup
22. Data Lake Options on AWS00:00
23. AWS Options: EMR (HDFS + Spark)00:00
24. AWS Options: EMR: S3 + Spark00:00
25. AWS Options: Athena00:00
26. Demo: Data Lake on S3 Pt 13:23
27. Demo: Data Lake on S3 Pt 200:00
28. Exercise 3: Data Lake on S3
29. Demo: Data Lake on EMR Pt 100:00
30. Demo: Data Lake on EMR Pt 200:00
31. Demo: Data Lake on Athena Pt 100:00
32. Demo: Data Lake on Athena Pt 200:00
33. Data Lake Issues00:00
34. [AWS] Launch EMR Cluster and Notebook
35. [AWS] Avoid Paying Unexpected Costs
35. [AWS] Avoid Paying Unexpected Costs
Project: Data Lake
Project Introduction
Project Datasets
Project Instructions
Project Description – Data Lake
Project Rubric – Data Lake
Data Pipeline
01. Welcome1:15
03. What is a Data Pipeline?2:14
03. Quiz
04. Data Validation2:00
04. Quiz
05. DAGs and Data Pipelines3:25
05. Quiz
06. Bikeshare DAG1:23
06. Quiz
07. Introduction to Apache Airflow2:11
08. Demo 1: Airflow DAGs8:23
09.1 Install Apache Airflow on Windows using Windows Subsystem for Linux (WSL)15:20
09.2 Install Apache Airflow on MacOS7:32
09. Workspace Instructions
10. Exercise 1: Airflow DAGs
11. Solution 1: Airflow DAGs1:26
12. How Airflow Works00:00
13. Airflow Runtime Architecture
13. Quiz
14. Building a Data Pipeline00:00
15. Demo 2: Run the Schedules00:00
16. Exercise 2: Run the Schedules
17. Solution 2: Run the Schedules00:00
18. Operators and Tasks2:48
19. Demo 3: Task Dependencies00:00
20. Exercise 3: Task Dependencies
21. Solution: Task Dependencies00:00
22. Airflow Hooks00:00
23. Demo 4: Connections and Hooks00:00
24. Exercise 4: Connections and Hooks
25. Solution 4: Connections and Hooks
26. Demo 5: Context and Templating00:00
27. Exercise 5: Context and Templating
28. Solution 5: Context and Templating00:00
29. Quiz: Review of Pipeline Components
29. Quiz:
30. Demo: Exercise 6: Building the S3 to Redshift DAG7:07
31. Exercise 6: Build the S3 to Redshift DAG
32. Solution 6: Build the S3 to Redshift DAG
33. Conclusion00:00
Data Quality
01. What we are going to learn?0:36
02. What is Data Lineage?00:00
03. Visualizing Data Lineage2:15
03. Quiz
04. Demo 1: Data Lineage in Airflow5:11
05. Exercise 1: Data Lineage in Airflow
06. Solution 1: Data Lineage in Airflow00:00
07. Data Pipeline Schedules00:00
08. Scheduling in Airflow00:00
08. Quiz
09. Updating DAGs00:00
09. Updating DAGs 200:00
10. Demo 2: Schedules and Backfills in Airflow00:00
11. Exercise 2: Schedules and Backfills in Airflow
12. Solution 2: : Schedules and Backfills in Airflow00:00
13. Data Partitioning00:00
14. Goals of Data Partitioning00:00
14. Quiz
15. Demo 3: Data Partitioning00:00
16. Exercise 3: Data Partitioning
17. Solution 3: Data Partitioning00:00
18. Data Quality00:00
18. Quiz
19. Demo 4: Data Quality00:00
20. Exercise 4: Data Quality
21. Solution 4: Data Quality
22. Conclusion00:00
Production Data Pipelines
01. Lesson Introduction00:00
02. Extending Airflow with Plugins00:00
03. Extending Airflow Hooks & Contrib00:00
04. Demo 1: Operator Plugins00:00
05. Exercise 1: Operator Plugins
06. Solution 1: Operator Plugins00:00
07. Best Practices for Data Pipeline Steps – Task Boundaries00:00
08. Demo 2: Task Boundaries7:25
09. Exercise 2: Refactor a DAG
10. Solution 2: Refactor a DAG00:00
11. Subdags: Introduction and When to Use Them00:00
12. SubDAGs: Drawbacks of SubDAGs00:00
13. Quiz: Subdags
13. Quiz
14. Demo 3: SubDAGs00:00
15. Exercise 3: SubDAGs
16. Solution 3: Subdags00:00
17. Monitoring00:00
18. Monitoring
18. Quiz
19. Exercise 4: Building a Full DAG
20. Solution 4: Building a Full Pipeline
21. Conclusion00:00
22. Additional Resources: Data Pipeline Orchestrators
Project Data Pipelines
Project Introduction
Project Overview
Add Airflow Connections to AWS
Project Instructions
Workspace Instructions
Project Workspace
Project Description – Data Pipelines
Project Rubric – Data Pipelines
Take 30 Min to Improve your LinkedIn
Get Opportunities with LinkedIn2:01
Use Your Story to Stand Out3:00
Why Use an Elevator Pitch00:00
Create Your Elevator Pitch00:00
Pitching to a Recruiter00:00
Use Your Elevator Pitch on LinkedIn
06. Create Your Profile With SEO In Mind
07. Profile Essentials
08. Work Experiences & Accomplishments
09. Build and Strengthen Your Network
10. Reaching Out on LinkedIn
11. Boost Your Visibility
12. Up Next
Project Description – Improve Your LinkedIn Profile
Project Rubric – Improve Your LinkedIn Profile
Capstone Project
Project Instructions
Project Resources
Project Description – Data Engineering Capstone Project
Project Rubric – Data Engineering Capstone Project
Job Search
Intro00:00
Job Search Mindset00:00
Target Your Application to An Employer00:00
Open Yourself Up to Opportunity0:24
Refine Your Entry-Level Resume
Convey Your Skills Concisely1:23
Effective Resume Components1:36
Resume Structure2:12
Describe Your Work Experiences1:09
Resume Reflection00:00
Craft Your Cover Letter
Get an Interview with a Cover Letter!1:39
Purpose of the Cover Letter1:10
Cover Letter Components0:54
Write the Introduction1:34
Write the Body00:00
Write the Conclusion00:00
Format00:00
Optimize Your GitHub Profile
Introduction00:00
GitHub profile important items00:00
Good GitHub repository00:00
Interview Part 100:00
Identify fixes for example “bad” profile00:00
Identify fixes for example “bad” profile 200:00
Quick Fixes #100:00
Quick Fixes #200:00
Writing READMEs00:00
Interview Part 200:00
Commit messages best practices
Reflect on your commit messages00:00
Participating in open source projects00:00
Starring interesting repositories00:00
Develop Your Personal Brand
Why Network?00:00
Why Use Elevator Pitches?00:00
Personal Branding
Elevator Pitch00:00
Pitching to a Recruiter00:00
Use Your Elevator Pitch00:00
Project Portfolio
Real-world projects are integral to every Bootcamp AI Nanodegree program. They become the foundation for a job-ready portfolio to help learners advance their careers in their chosen field. The projects in the Data Engineer Nanodegree program were designed in collaboration with a group of highly talented industry professionals to ensure you develop the most in-demand skills. Every project in a Nanodegree program is human-graded by a member of Bootcamp AI’s mentor and reviewer network. These project reviews include detailed, personalized feedback on how you can improve their work. Bootcamp AI graduates consistently rate projects and project reviews as one of the best parts of their experience with Bootcamp AI.
The Project Journey
The projects will take you on a journey where you’ll assume the role of a Data Engineer at a fabricated data streaming company called “Sparkify” as it scales its data engineering in both size and sophistication. You’ll work with simulated data of listening behavior, as well as a wealth of metadata related to songs and artists. You’ll start working with a small amount of data, with low complexity, processed and stored on a single machine. By the end, you’ll develop a sophisticated set of data pipelines to work with massive amounts of data processed and stored on the cloud. There are five projects in the program. Below is a description of each.
Project 1 – Data Modeling
In this project, you’ll model user activity data for a music streaming app called Sparkify. The project is done in two parts. You’ll create a database and import data stored in CSV and JSON files, and model the data. You’ll do this first with a relational model in Postgres, then with a NoSQL data model with Apache Cassandra. You’ll design the data models to optimize queries for understanding what songs users are listening to. For PostgreSQL, you will also define Fact and Dimension tables and insert data into your new tables. For Apache Cassandra, you will model your data to help the data team at Sparkify answer queries about app usage. You will set up your Apache Cassandra database tables in ways to optimize writes of transactional data on user sessions.
Project 2 – Cloud Data Warehousing
In this project, you’ll move to the cloud as you work with larger amounts of data. You are tasked with building an ELT pipeline that extracts Sparkify’s data from S3, Amazon’s popular storage system. From there, you’ll stage the data in Amazon Redshift and transform it into a set of fact and dimensional tables for the Sparkify analytics team to continue finding insights in what songs their users are listening to.
Project 3 – Data Lakes with Apache Spark
In this project, you’ll build an ETL pipeline for a data lake. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in the app. You will load data from S3, process the data into analytics tables using Spark, and load them back into S3. You’ll deploy this Spark process on a cluster using AWS.
Project 4 – Data Pipelines with Apache Airflow
In this project, you’ll continue your work on Sparkify’s data infrastructure by creating and automating a set of data pipelines. You’ll use the up-and-coming tool Apache Airflow, developed and open-sourced by Airbnb and the Apache Foundation. You’ll configure and schedule data pipelines with Airflow, setting dependencies, triggers, and quality checks as you would in a production setting.
Project 5 – Data Engineering Capstone
The capstone project is an opportunity for you to combine what you’ve learned throughout the program into a more self-driven project. In this project, you’ll define the scope of the project and the data you’ll be working with. We’ll provide guidelines, suggestions, tips, and resources to help you be successful, but your project will be unique to you. You’ll gather data from several different data sources; transform, combine, and summarize it; and create a clean database for others to analyze.
We’re excited to see what you build!