NICO 101-0: Introduction to Programming for Big Data

Fall 2017 - Professors Luis Amaral and Adam Pah

Lectures: September 6-8 and 12-15 from 9:30am-12:00pm & 1:30pm-4:30pm in L361.

Overview: Our digital, connected, sensor rich world is generating extraordinary amounts of data ("Big Data") that are being used to purposes as diverse as teaching a computer to win at Jeopardy or offering taxi alternatives. The skills needed to go from data to knowledge and application, which go under the name of Data Science, are in big demand in industry, government, and academia. This course provides an introduction to the foundational skills needed by data scientists. Prior knowledge of programming is not needed.

Prerequisites: None.

Restrictions: Intended primarily for undergraduate students. Other students must contact the instructor. Students will need an up-to-date laptop running Linux, OS X, or Windows 7 or higher. Chromebooks will not be permitted. Prior to the start of the course, students must install several packages and verify that they run properly in their machine.

Texts: Lecture materials are available online.

Requirements: There will be about 6 homework assignments involving the writing of Python code for solving specific problems. Students’ solutions will be uploaded to a server where they will be unit tested. All students will be expected to attend lectures and complete in class assignments.

TOPICS:

  • Examples of problems amenable to computation
  • Overview of computer hardware & different filesystems
  • The Zen of Python: Code style & commenting
  • Using IPython notebook
  • Basic Python data types: Integers, floats, strings, & lists
  • Flow control: Loops, conditionals, exceptions
  • Input & output
  • Functions & code modularity
  • The Python standard library: string, math, sys, & so on
  • Sophisticated data types: tuples, sets, & dictionaries
  • Data visualization using matplotlib
  • Numerical computing using numpy & scipy
  • Example: Image processing using numpy
  • Retrieving data from the web using requests & splinter
  • Text analysis & intro to regular expressions
  • Example: Computing with Shakespeare
  • Computing with dates & times
  • Analyzing tabular data using pandas
  • Example: Time series analysis of stock prices
  • Numerical precision & algorithm scaling
  • Statistical analysis with statsmodels
  • Finding other resources

Visit CAESAR to register for the course.

Certificate in Integrated Data Science

Northwestern’s Graduate School offers certificates to help connect graduate students across various departments and programs. Earning a certificate not only formally recognizes a student's focus on and achievement in a particular topic area, but helps form a cohort of students and faculty interested in particular interdisciplinary areas. Certificates typically require five courses for completion and the certificate will appear on your transcript.

Why get a Certificate in Integrated Data Science? Data Science is an emerging field that requires specialized training, but simultaneously connects very different fields; for instance, it connects business management and the medical sciences, and can connect astrophysics and Earth science. To address this, the Certificate in Integrated Data Science gives both a strong background in detailed techniques, where necessary, and a broad overview in the variety of techniques appropriate to each domain.

What classes do I need to take to get a Certificate in Integrated Data Science? The Northwestern Certificate in Integrated Data Science launched by the IDEAS Traineeship requires five courses, from three different categories, listed below. Students must take at least one course from Group A, at least two courses from Group B, and at least one course from Group C. Additional certificate options are in development and are anticipated to fit within this same structure.

Visit IDEAS CIERA for more information.

Group A

  • IDS-401: Data-Driven Research in Physics, Geophysics, and Astronomy

This course integrates the domain-focused projects in Physics & Astronomy (P&A) and Earth and Planetary Sciences (EPS) and will be team-taught by one professor from P&A and one from EPS. This course will cover one quarter of material, but be spread over 2 quarters (Fall and Winter every year). It will focus on the science motivation and goals that unite three distinct research projects: LSST, aLIGO, and EarthScope. It will focus on principles and methods of data analysis. Spreading the course over two quarters will allow alignment and further interdisciplinary integration with IDS-421 and IDS-422.

Prerequisite: None.

Group B

  • IDS-421: Integrated Data Analytics I

also: PHYS 441: Statistical Methods for Physicists and Astronomers

Data analysis in the modern age requires familiarity of many concepts and methods from statistics. This course provides an introduction to the basics as well as exposure to some of the most advanced techniques. The emphasis will be on practical problems from physics and astronomy, rather than on theory or on statistical methods from other fields. Prior knowledge of statistics is not required.

Prerequisite: None.

  • IDS-422: Integrated Data Analytics II

also EPS 329: Mathematical Inverse Methods in Earth and Environmental Sciences

This course covers the theory and application of inverse methods to gravity, magnetotelluric, seismic waveform, multilateration, and students’ data. In particular, students will learn how about nonlinear, linearized, underdetermined, and mixed-determined problems and solution methods, such as regularized least-squares and neighborhood algorithms. Prerequisite: MATH 230, STAT 232, or equivalent; MATH 240 or STAT 320-1, 320-2 recommended.

  • IDS-423: Integrated Data Analytics III

also EECS 495: Machine Learning: Foundations, Applications, and Algorithms

From robotics, speech recognition, and analytics to finance and social network analysis, machine learning has become one of the most useful set of scientific tools of our age. With this course we want to bring interested students and researchers from a wide array of disciplines up to speed on the power and wide applicability of machine learning. The ultimate aim of the course is to equip you with all the modeling and optimization tools you’ll need in order to formulate and solve problems of interest in a machine learning framework. We hope to help build these skills through lectures and reading materials which introduce machine learning in the context of its many applications, as well as by describing in a detailed but user-friendly manner the modern techniques from nonlinear optimization used to solve them. In addition to a well curated collection of reference materials, registered students will receive a draft of a forthcoming manuscript authored by the instructors on machine learning to use as class notes.

Prerequisite: Students should have a thorough understanding of vector calculus and linear algebra, and have a basic understanding of the Python or MATLAB/OCTAVE programming environments.

Group C

From the Department of Electrical Engineering and Computer Science:

  • Data Management and Information Processing (EECS 317)
  • Machine Learning (EECS 349)
  • Digital Image Processing (EECS 420)
  • Nonlinear Optimization (EECS 479)
  • Probabilistic Graphical Models (EECS 395/495)
  • Statistical Pattern Recognition (EECS 433)
  • Social Media Mining (EECS 510)
  • Geospatial Vision and Visualization (EECS 395/495)
  • Data Science (EECS 395/495)

From the Department of Engineering Sciences and Applied Mathematics:

  • Models in Applied Mathematics (ES_APPM 421-1)
  • Numerical Methods for Random Processes (ES_APPM 448)

From the Department of Statistics:

  • Time Series Analysis (STAT 454)
  • Applied Bayesian Inference (STAT 457)
  • Theory of Data Mining (STAT 461)

From the Department of Industrial Engineering and Management Sciences:

  • Statistical Methods for Data Mining (IEMS 304)