DSCI 6007 Distributed and Scalable Data Engineering
Fall 2020
Meeting Times and Location(s): Hybrid Tuesdays 3:55pm – 6:45pm BCKM 233A
Credit Hours: 3
Faculty Contact Information:
Dr. Vahid Behzadan, Assistant Professor
Email: vbehzadan@newhaven.edu
Phone: 203-479-472
COURSE SYLLABUS
This syllabus is informational in nature and is not an express or implied contract. It is subject to change due to unforeseen circumstances, as a result of any circumstance outside the University’s control, or as other needs arise. If, in the University’s sole discretion, public health conditions or any other matter affecting the health, safety, upkeep or wellbeing of our campus community or operations requires the University to make any syllabus or course changes or move to remote teaching, alternative assignments may be provided so that the learning objectives for the course, as determined by the University, can still be met. The University does not guarantee that this syllabus will not change, nor does it guarantee specific in-person, on-campus classes, activities, opportunities, or services or any other particular format, timing, or location of education, classes, activities, or services.
Course Description:
Advanced topics in “Big Data” infrastructure and architectures focusing on computing resources and programming environments to support the development of efficiently scalable high-volume distributed machine learning algorithms.
Required Text(s):
None. Data Engineering is a new and evolving field, and there is no standard book that covers it completely and is current. We will post readings for each day. They will be video tutorials, book chapters, and blog posts.
Optional References:
- The Data Engineering Cookbook by Andreas Kretz. Open-source work in progress.
- Designing Data-Intensive Applications (DDIA) by Martin Kleppmann. Clear, concise, and practical. Right now preview edition only, a game changer when finished.
- Big Data by Nathan Marz with James Warren. Much of the technology has changed since that book was written but the basic principles are the same.
- Learning Spark Spark is the new dominant analytics framework. This is an accessible introduction.
- Advanced Analytics with Spark Learn how to leverage Spark to solve Data Science problems through guided projects.
- The Manga Guide to Databases Learn databases without the tedium.
Course Structure/Course Format/Course Objectives:
This course is an “active” learning environment. You’ll learn through doing. The focus will be applying concepts to data through programming.
Before class you will complete preparation materials (e.g., watch videos, read chapters, and complete workbooks). All preparation materials should be covered prior to the start of each class session. These are always required unless explicitly labeled as optional. These materials will be resources for factual knowledge. We will not be delivering traditional lectures. You are expected to be familiar with the basic concepts and technical jargon by the start of class.
In-class time is precious – We’ll reserve it for discussion, presenting complex material, answering questions, and working on exercises.
Course Objectives:
By the end of this course, students will be able to:
- Install and run a Linux virtual machine locally and in the cloud
- Utilize *NIX command line tools to manipulate and analyze data
- Deploy and manipulate data and working code in the cloud
- Write complex SQL queries
- Design a database that conforms to the third normal form (3NF)
- Design, create, and query NoSQL databases
- Identify embarrassingly parallelizable tasks and parallelize them
- Describe and apply the MapReduce algorithm
- Describe and apply Spark’s Dataset abstraction
- Apply machine learning in a distributed architecture
- Analyze streaming data “real-time”
- Apply probabilistic data structures to handle high volume/velocity data
- Build and use an information retrieval (IR) or search engine
- Build an end-to-end distributed data-pipeline
Student Learning Outcomes:
Demonstrate achievement of course objectives in class discussion, lab assignments, and projects.
Course Requirements & Assessment:
Please see official University of New Haven Academic Policies located in the links below:
Assignments/Projects:
- All work must be turned in via Canvas, unless otherwised specified (some lab assignments will require submission on Github or AWS). Please turn in whatever you have for participation credit, even if incomplete.
- Pen-and-paper quizzes will also be required. However, submission will need to be in the form of scanned/photographed copies through Canvas.
Participation:
Active-learning techniques will be used, such as group discussions and “think-pair-share”, requiring students to work individually and/or with other students. Refusal to participate will be treated as absence from class and ultimately lead to dismissal from the class (see University Policies).
Midterm and Final Projects:
The midterm and final projects aim to evaluate the students’ ability to leverage the skills and materials covered in the lectures and labs in solving realistic problems in data engineering. The assessment of both midterm and final projects will be based on the outcome, demonstrated in a written technical report, as well as class presentations.
Grading:
Grades earned are based on your performance on class participation (including quizzes), lab assignments, and 2 class projects.
Participation (incl. quizzes) | %10 |
Labs | %30 |
Midterm Project | %10 |
Final Project | %50 |
Total** | 100% |
Typical Graduate Scale |
Grades Scored Between & it’s Letter Equivalent |
97 to 100 — A+ |
94 to Less than 97 — A |
90 to Less than 94 — A- |
87 to Less than 90 — B+ |
84 to Less than 87 — B |
80 to Less than 84 — B- |
77 to Less than 80 — C+ |
74 to Less than 77 — C |
70 to Less than 74 — C- |
Less than 70 — F |
Expectations:
Students should expect to spend at least 3 hours on academic studies outside, and in addition to, each hour of class time. There will be readings, simple questions/problems, and lab and projects.
Individual Work:
Students must work individually on assignments and projects unless specifically allowed to work in groups by the instructor. Any work taken from the internet must be cited properly (acceptance of code taken from elsewhere is at the discretion of the instructor) or will be considered plagiarism. Failure to adhere to this policy will result in penalties ranging from a zero on the assignment to a zero in the final grade. Students may also be subject to disciplinary action by the University of New Haven (see University Policies).
Course Outline/Schedule:
Day/Date | Topic/Note |
8/25 | Welcome to data engineering |
9/1 | Internet, HTTP, and HTML |
9/8 | Linux, Virtualization & the Cloud |
9/15 | Databases – Intro to NoSQL |
9/22 | Databases – Advanced SQL |
9/29 | Intro to Parallelization and MapReduce |
10/6 | Intro to Spark |
10/13 | Intro to Spark (continued) |
10/20 | More Spark – Midterm Project Announced |
10/27 | Even More Spark – Designing Big Data Systems – Midterm Project Due |
11/3 | Streaming – Introduction to Machine Learning in Sparks |
11/10 | Streaming (continued) – Final Project Topic Selection |
11/17 | Introduction to Kubernetes |
11/24 | Full-Stack Deep Learning |
12/8 | Final Project Presentations |
12/10 | Final Project Report Due |
Reporting Bias Incidents:
At the University of New Haven, there is an expectation that all community members are committed to creating and supporting a climate which promotes civility, mutual respect, and open-mindedness. There also exists an understanding that with the freedom of expression comes the responsibility to support community members’ right to live and work in an environment free from harassment and fear. It is expected that all members of the University community will engage in anti-bias behavior and refrain from actions that intimidate, humiliate, or demean persons or groups or that undermine their security or self-esteem. (Reporting Options).
University-wide Academic Policies:
A continually-updated list of University-wide academic policies and descriptions of key university student resources, can be found on Canvas. You can access them by simply clicking on the (?) help button.
The University-wide academic policies include (but are not limited to) the University’s attendance policy, procedures for both adding / dropping a course and course withdrawals, an explanation for the sorts of circumstances where incomplete (INC) grades could be considered by the faculty, and the academic integrity policy (among others). Also in this location you will find information regarding the process for reporting bias and topics related to our maintaining a positive learning environment (including, but not limited to, discrimination and sexual misconduct).
The list of key university student resources to enable learning include (but are not limited to) the University’s Center for Student Success, Writing Center, Center for Learning Resources, and the Accessibility Resource Center.