One of the biggest holes in how data science (and computer science) is taught is the lack of vocational courses for how to use data. Most CS programs will have a database course which covers the core trade-offs around building a database – but will only provide a cursory glance toward how to write SQL. Similarly when teaching how to manipulate data in Python there is usually only a week or two spent on using Pandas, despite it being the most common method for exploring data.

The notes that can be found herein represent a fingers-on-the-keys approach to teaching these two topics. These note present a distinctly practical approach to learning how to use these two tools. This course has been taught at the undergraduate, graduate and executive certificate level in a variety of different course structures.

This page contains the most recent version of those notes which is currently taught to students at the MS-CAPP Program at the University of Chicago. Versions of this course have also been taught at the undergraduate and executive education levels.

The data used in this course can be found in the repository here. The table of contents with links to specific chapters can be found below. Please note that this is all a work in progress and each time I teach this there are both minor and major changes that occur.

You can find a combined PDF with all notes here. Specific chapters can be found below, but links do not currently work in the chapter specific PDFs.

.

Table of Contents

Introduction and Errata Introduction
Relational Databases
Rows and Columns
  • What is a Relational Database
  • Selecting Columns
  • WHERE: Filtering rows
  • Null
  • ORDER BY and LIMIT
  • Column Numbering
  • Where are we: A Note on Scope
Chapter 1
Basic Manipulations
  • Types
  • Renaming a Column
  • Basic Mathematical Manipulations, ABS and LEAST/GREATEST
  • Queries without a FROM Clause and Singletons
  • String Functions: LEFT, RIGHT, LOWER, UPPER, LENGTH, TRIM and CONCAT
  • ROUND and Changing Types (CAST)
  • CAST and changing types
Chapter 2
Subqueries, Distinct & Case
  • Query Evaluation Order: SELECT and WHERE
  • Comparisons: BETWEEN, LIKE and ILIKE
  • CASE: Conditional Logic
  • The DISTINCT Operator
  • Subqueries (IN, ANY, ALL)
  • Correlated Subqueries
Chapter 3
Database Internals: Transactions
  • REDO / COMBINE NEXT SECTIONS
  • Table Creation and Deletion
  • Database Operations: CRUD
  • Creating Tables, Constraints and Deleting tables
  • Altering Tables
  • Inserting, Copying, Updating and Deleting
  • Transactions and ACID
  • Isolation Levels in Relational Databases
  • Why do we care (NoSQL)?
  • NoSQL
  • Transaction Implementations [TBD]
Chapter 4
Aggregations
  • Introduction to MTA data set
  • GROUP BY clause
  • Column numbering syntax
  • Aggregates and CASE Statements
  • Named Subqueries
Chapter 5
Dates and Types
  • Date Types
  • Date Functions
  • Hard GROUP BY problems
Chapter 6
Averages
  • The Trouble with Averages
  • HAVING
  • COALESCE and NVL
Chapter 7
Joins
  • Joins
  • UNION and UNION ALL
  • Best Practices when Combining Tables
  • Intermediate Joins
  • Statistical Analysis in SQL
Chapter 8
Advanced Joins
  • The Shape of Data
  • Revenue over time & Advanced Joins
Chapter 9
Analytic Functions & CTE's
  • Analytic Functions
  • Using Analytic Functions with Transaction Data
  • Common Table Expressions (``CTE")
  • CTEs with the transaction data
Chapter 10
Database Internals: Performance Evaluation
  • Normalization
  • Views
  • Information Schema
  • Performance Considerations
  • Index
  • Distributed Systems and the CAP Theorem
Chapter 11
Extensions [TBD]
  • More Advanced Joins
  • OLAP: Cube and Rollup
  • Schemas
  • Keys
  • Data Exploration Strategies
  • Query Strategies
Chapter 12
Interview Hints
  • Interview Hints
  • Example Interview #1
  • Example Interview #2
  • Example Interview #3
  • Example Interview #4
Chapter 13
Pandas
Introduction
  • What is Pandas
  • Data structures
  • Selecting Columns and Rows
  • Column Types Conversion
  • Dealing with NaN
  • Choosing the largest and smallest values
  • Manipulating Data & Method Chaining
  • Indexes: Creating and Dropping
  • Views and Copies
Chapter 14
More Manipulations and Types
  • Sorting DataFrames
  • Dealing with Duplicates
  • Using Type specific functions
  • CASE style statements and the ``isin" operator
  • Regex Pattern Matching
Chapter 15
Aggregations
  • Introduction to the MTA dataset
  • Simple Aggregations
  • GroupBy Objects
  • Advanced Index / Multiindex
  • If not indexes...
  • Indexing with aggregations, a big Gotcha
Chapter 16
Joins
  • Helpful Table / Review
  • Merging data in Pandas
  • Complex Join Conditions
  • Stacking Data
  • Lags and Leads
  • Apply, map and applymap: Advanced Transformations
Chapter 17
Window Functions
  • Window Functions in Pandas
  • Some gotchas
  • Reshaping Data: Transpose, Stack and Unstack
  • A Bunch of stuff to clean up
  • Combining with the original DataFrame
  • Moving the Window
  • Pivot / Melt
Chapter 18
Appendix
Data Dictionaries
  • Introduction
  • Iowa Fleet data
  • NY MTA Data
  • Daily Stock Data: s2010 and s2011
  • Annual Fundamental Financial information: fnd
  • Soap Transaction Data
Appendix A
Connecting SQL to Python or R
  • Connecting to any database: ODBC and JDBC
  • Connecting only to PostgreSQL
Appendix B
Assignments
Example Exams