Welcome to the Nanodegree program

Introduction to Software Engineering

In this lesson, you’ll write production-level code and practice object-oriented programming, which you can integrate into machine learning projects.

Software Engineering Practices Pt I

Software Engineering Practices Pt II

OOP

Portfolio Exercise: Upload a Package to PyPi

Part 15-Module 01-Lesson 06_Web Development

Portfolio Exercise: Deploy a Data Dashboard

Introduction to Data Engineering

ETL Pipelines

Introduction to NLP

Learn Natural Language Processing one of the fields with the most real applications of Deep Learning

Machine Learning Pipelines

Disaster Response Pipeline

Project1: Disaster Response Pipeline

Part 17-Module 02-Lesson 01_Concepts in Experiment Design

Part 17-Module 02-Lesson 02_Statistical Considerations in Testing

Statistical Considerations in Testing

Part 17-Module 02-Lesson 03_AB Testing Case Study

A/B Testing Case Study

Part 17-Module 02-Lesson 04_Portfolio Exercise Starbucks

Part 17-Module 03-Lesson 01_Introduction to Recommendation Engines

Part 17-Module 03-Lesson 02_Matrix Factorization for Recommendations

Part 17-Module 04-Lesson 01_Recommendation Engines

Part 17-Module 05-Lesson 01_Upcoming Lesson

Sentiment Prediction RNN

Convolutional Neural Networks

Transfer Learning

Weight Initialization

Autoencoders

Job Search

Find your dream job with continuous learning and constant effort

Refine Your Entry-Level Resume

Craft Your Cover Letter

Optimize Your GitHub Profile

Develop Your Personal Brand

01. Introduction

Data Pipelines: ETL vs ELT

Data pipeline is a generic term for moving data from one place to another. For example, it could be moving data from one server to another server.

ETL

An ETL pipeline is a specific kind of data pipeline and very common. ETL stands for Extract, Transform, Load. Imagine that you have a database containing web log data. Each entry contains the IP address of a user, a timestamp, and the link that the user clicked.

What if your company wanted to run an analysis of links clicked by city and by day? You would need another data set that maps IP address to a city, and you would also need to extract the day from the timestamp. With an ETL pipeline, you could run code once per day that would extract the previous day’s log data, map IP address to city, aggregate link clicks by city, and then load these results into a new database. That way, a data analyst or scientist would have access to a table of log data by city and day. That is more convenient than always having to run the same complex data transformations on the raw web log data.

Before cloud computing, businesses stored their data on large, expensive, private servers. Running queries on large data sets, like raw web log data, could be expensive both economically and in terms of time. But data analysts might need to query a database multiple times even in the same day; hence, pre-aggregating the data with an ETL pipeline makes sense.

ELT

ELT (extract, load, transform) pipelines have gained traction since the advent of cloud computing. Cloud computing has lowered the cost of storing data and running queries on large, raw data sets. Many of these cloud services, like Amazon RedshiftGoogle BigQuery, or IBM Db2 can be queried using SQL or a SQL-like language. With these tools, the data gets extracted, then loaded directly, and finally transformed at the end of the pipeline.

However, ETL pipelines are still used even with these cloud tools. Oftentimes, it still makes sense to run ETL pipelines and store data in a more readable or intuitive format. This can help data analysts and scientists work more efficiently as well as help an organization become more data driven.