Distributed and Scalable Data Engineering – F’20

DSCI 6007 Distributed and Scalable Data Engineering
Fall 2020
Meeting Times and Location(s): Hybrid Tuesdays 3:55pm – 6:45pm BCKM 233A
Credit Hours: 3
Faculty Contact Information:
Dr. Vahid Behzadan, Assistant Professor
Email: vbehzadan@newhaven.edu
Phone: 203-479-472

COURSE SYLLABUS

This syllabus is informational in nature and is not an express or implied contract. It is subject to change due to unforeseen circumstances, as a result of any circumstance outside the University’s control, or as other needs arise. If, in the University’s sole discretion, public health conditions or any other matter affecting the health, safety, upkeep or wellbeing of our campus community or operations requires the University to make any syllabus or course changes or move to remote teaching, alternative assignments may be provided so that the learning objectives for the course, as determined by the University, can still be met. The University does not guarantee that this syllabus will not change, nor does it guarantee specific in-person, on-campus classes, activities, opportunities, or services or any other particular format, timing, or location of education, classes, activities, or services.

Course Description:

Advanced topics in “Big Data” infrastructure and architectures focusing on computing resources and programming environments to support the development of efficiently scalable high-volume distributed machine learning algorithms.

Required Text(s):

None. Data Engineering is a new and evolving field, and there is no standard book that covers it completely and is current. We will post readings for each day. They will be video tutorials, book chapters, and blog posts.

Optional References:

Course Structure/Course Format/Course Objectives:

This course is an “active” learning environment. You’ll learn through doing. The focus will be applying concepts to data through programming.

Before class you will complete preparation materials (e.g., watch videos, read chapters, and complete workbooks). All preparation materials should be covered prior to the start of each class session. These are always required unless explicitly labeled as optional. These materials will be resources for factual knowledge. We will not be delivering traditional lectures. You are expected to be familiar with the basic concepts and technical jargon by the start of class.

In-class time is precious – We’ll reserve it for discussion, presenting complex material, answering questions, and working on exercises.

Course Objectives:

By the end of this course, students will be able to:

  • Install and run a Linux virtual machine locally and in the cloud
  • Utilize *NIX command line tools to manipulate and analyze data
  • Deploy and manipulate data and working code in the cloud
  • Write complex SQL queries
  • Design a database that conforms to the third normal form (3NF)
  • Design, create, and query NoSQL databases
  • Identify embarrassingly parallelizable tasks and parallelize them
  • Describe and apply the MapReduce algorithm
  • Describe and apply Spark’s Dataset abstraction
  • Apply machine learning in a distributed architecture
  • Analyze streaming data “real-time”
  • Apply probabilistic data structures to handle high volume/velocity data
  • Build and use an information retrieval (IR) or search engine
  • Build an end-to-end distributed data-pipeline

Student Learning Outcomes:

Demonstrate achievement of course objectives in class discussion, lab assignments, and projects.

Course Requirements & Assessment:

Please see official University of New Haven Academic Policies located in the links below:

Assignments/Projects:

  • All work must be turned in via Canvas, unless otherwised specified (some lab assignments will require submission on Github or AWS). Please turn in whatever you have for participation credit, even if incomplete.
  • Pen-and-paper quizzes will also be required. However, submission will need to be in the form of scanned/photographed copies through Canvas.

Participation:

Active-learning techniques will be used, such as group discussions and “think-pair-share”, requiring students to work individually and/or with other students. Refusal to participate will be treated as absence from class and ultimately lead to dismissal from the class (see University Policies).

Midterm and Final Projects:

The midterm and final projects aim to evaluate the students’ ability to leverage the skills and materials covered in the lectures and labs in solving realistic problems in data engineering. The assessment of both midterm and final projects will be based on the outcome, demonstrated in a written technical report, as well as class presentations.

Grading:

Grades earned are based on your performance on class participation (including quizzes), lab assignments, and 2 class projects.

Participation (incl. quizzes)%10
Labs%30
Midterm Project%10
Final Project%50
Total**100%
**Final Grades are assigned with the following scale:
   Typical Graduate Scale
Grades Scored Between & it’s Letter Equivalent
97 to 100 — A+
94 to Less than 97 — A
90 to Less than 94 — A-
87 to Less than 90 — B+
84 to Less than 87 — B
80 to Less than 84 — B-
77 to Less than 80 — C+
74 to Less than 77 — C
70 to Less than 74 — C-
Less than 70 — F

Expectations:

Students should expect to spend at least 3 hours on academic studies outside, and in addition to, each hour of class time. There will be readings, simple questions/problems, and lab and projects.

Individual Work:

Students must work individually on assignments and projects unless specifically allowed to work in groups by the instructor. Any work taken from the internet must be cited properly (acceptance of code taken from elsewhere is at the discretion of the instructor) or will be considered plagiarism. Failure to adhere to this policy will result in penalties ranging from a zero on the assignment to a zero in the final grade. Students may also be subject to disciplinary action by the University of New Haven (see University Policies).

Course Outline/Schedule:

Day/DateTopic/Note
8/25Welcome to data engineering
9/1Internet, HTTP, and HTML
9/8Linux, Virtualization & the Cloud
9/15Databases – Intro to NoSQL
9/22Databases – Advanced SQL
9/29Intro to Parallelization and MapReduce
10/6Intro to Spark
10/13Intro to Spark (continued)
10/20More Spark – Midterm Project Announced
10/27Even More Spark – Designing Big Data Systems – Midterm Project Due
11/3Streaming – Introduction to Machine Learning in Sparks
11/10Streaming (continued) – Final Project Topic Selection
11/17Introduction to Kubernetes
11/24Full-Stack Deep Learning
12/8Final Project Presentations
12/10Final Project Report Due

Reporting Bias Incidents:

At the University of New Haven, there is an expectation that all community members are committed to creating and supporting a climate which promotes civility, mutual respect, and open-mindedness. There also exists an understanding that with the freedom of expression comes the responsibility to support community members’ right to live and work in an environment free from harassment and fear. It is expected that all members of the University community will engage in anti-bias behavior and refrain from actions that intimidate, humiliate, or demean persons or groups or that undermine their security or self-esteem. (Reporting Options).

University-wide Academic Policies:

A continually-updated list of University-wide academic policies and descriptions of key university student resources, can be found on Canvas.  You can access them by simply clicking on the (?) help button.

The University-wide academic policies include (but are not limited to) the University’s attendance policy, procedures for both adding / dropping a course and course withdrawals, an explanation for the sorts of circumstances where incomplete (INC) grades could be considered by the faculty, and the academic integrity policy (among others).  Also in this location you will find information regarding the process for reporting bias and topics related to our maintaining a positive learning environment (including, but not limited to, discrimination and sexual misconduct). 

The list of key university student resources to enable learning include (but are not limited to) the University’s Center for Student Success, Writing Center, Center for Learning Resources, and the Accessibility Resource Center.