DSCI 6007
Distributed and Scalable Data Engineering
Fall 2021
Meeting Times and Location(s): N/A – Online Asynchronous
Credit Hours: 3
Vahid Behzadan, Ph.D.
Faculty Contact Information:
Office Location: Maxy120A or Zoom
Phone: 203-479-4723 Email: vbehzadan@newhaven.edu
Office Hours: MW 12pm-1pm or by request
Department Chair: Dr. Ali Golbazi agolbazi@newhaven.edu
Course Description:
Advanced topics in “Big Data” infrastructure and architectures focusing on computing resources
and programming environments to support the development of efficiently scalable high-volume
distributed machine learning algorithms.
Required Text(s):
None. Data Engineering is a new and evolving field, and there is no standard book that covers it
completely and is current. We will post readings for each day. They will be video tutorials, book
chapters, and blog posts.
Optional References:
- The Data Engineering Cookbook by Andreas Kretz. Open-source work in progress.
- Designing Data-Intensive Applications (DDIA) by Martin Kleppmann. Clear, concise, and practical. Right now preview edition only, a game changer when finished.
- Big Data by Nathan Marz with James Warren. Much of the technology has changed since that book was written but the basic principles are the same.
- Learning Spark Spark is the new dominant analytics framework. This is an accessible introduction.
- Advanced Analytics with Spark Learn how to leverage Spark to solve Data Science problems through guided projects.
- The Manga Guide to Databases Learn databases without the tedium.
Other Materials/Supplies:
The class will be delivered online via Canvas. External tutorials, reading materials, and references
will be provided. Class projects require access to a computer capable of running an Ubuntu 20.04
Virtual Machine on VirtualBox. Cloud-based exercises will be on Amazon AWS via a free AWS
Academy account that will be provided to the students.
Course Structure/Course Format/Course Objectives:
This course is offered as an online asynchronous class: recorded lectures, external materials,
and projects will be posted weekly on Friday afternoons. Several TA and instructor office hours
will be held throughout the week to help with questions. The focus will be applying concepts to
data through programming.
Course Objectives:
By the end of this course, students will be able to:
- Install and run a Linux virtual machine locally and in the cloud
- Utilize *NIX command line tools to manipulate and analyze data
- Deploy and manipulate data and working code in the cloud
- Write complex SQL queries
- Design a database that conforms to the third normal form (3NF)
- Design, create, and query NoSQL databases
- Identify embarrassingly parallelizable tasks and parallelize them
- Describe and apply the MapReduce algorithm
- Describe and apply Spark’s Dataset abstraction
- Apply machine learning in a distributed architecture
- Analyze streaming data “real-time”
- Apply probabilistic data structures to handle high volume/velocity data
- Build and use an information retrieval (IR) or search engine
- Build an end-to-end distributed data-pipeline
Student Learning Outcomes:
Demonstrate achievement of course objectives in class discussion, lab assignments, and projects.
Course Requirements & Assessment:
Please see official University of New Haven Academic Policies located in the links below:
- All work must be turned in via Canvas, unless otherwised specified (some lab assignments will require submission on Github or AWS). Please turn in whatever you have for participation credit, even if incomplete.
- Pen-and-paper quizzes will also be required. However, submission will need to be in the form of scanned/photographed copies through Canvas.
Active-learning techniques will be used, such as group discussions and “think-pair-share”, requiring students to work individually and/or with other students. Refusal to participate will be treated as absence from class and ultimately lead to dismissal from the class (see University Policies).
Midterm and Final Projects:
The midterm and final projects aim to evaluate the students’ ability to leverage the skills and materials covered in the lectures and labs in solving realistic problems in data engineering. The assessment of both midterm and final projects will be based on the outcome, demonstrated in a written technical report, as well as class presentations.
Grades earned are based on your performance on class participation (including quizzes), lab assignments, and 2 class projects.
Participation (incl. quizzes) | %10 |
Labs | %25 |
Midterm Project | %15 |
Final Project | %50 |
Total** | 100% |
Typical Undergraduate Scale |
Grades Scored Between & it’s Letter Equivalent |
97 to 100 A+ |
94 to Less than 97 A |
90 to Less than 94 A- |
87 to Less than 90 B+ |
84 to Less than 87 B |
80 to Less than 84 B- |
77 to Less than 80 C+ |
74 to Less than 77 C |
70 to Less than 74 C- |
67 to Less than 70 D+ |
64 to Less than 67 D |
60 to Less than 64 D- |
Less than 60 F |
Students should expect to spend at least 9 hours on academic studies per week on this course.
There will be readings, simple questions/problems, lab assignments and projects. Students must
work individually on assignments and projects unless specifically allowed to work in groups by
the instructor. Any work taken from the internet must be cited properly (acceptance of code taken
from elsewhere is at the discretion of the instructor) or will be considered plagiarism. Failure to
adhere to this policy will result in penalties ranging from a zero on the assignment to a zero in the
final grade. Students may also be subject to disciplinary action by the University of New Haven
(see University Policies).
Course Outline/Schedule:
Day/Date | Topic/Note |
Week 1 | Welcome to data engineering |
Week 2 | Internet, HTTP, and HTML |
Week 3 | Linux, Virtualization & the Cloud |
Week 4 | Databases – Intro to NoSQL |
Week 5 | Databases – Advanced SQL |
Week 6 | Intro to Parallelization and MapReduce |
Week 7 | Intro to Spark |
Week 8 | More Spark – Midterm Project Announced |
Week 9 | Even More Spark – Designing Big Data Systems – Midterm Project Due |
Week 10 | Streaming – Introduction to Machine Learning in Sparks |
Week 11 | Streaming – Continued |
Week 12 | Introduction to Kubernetes – Final Project Topic Selection |
Week 13 | Full-Stack Deep Learning – I |
Week 14 | Full-Stack Deep Learning – II |
Week 15 | Final Project Presentations |
Finals Week | Final Project Report Due |
Diversity Statement
The University of New Haven embraces diversity and recognizes our responsibility to foster a diverse, inclusive, and welcoming environment in which all members of the Charger community of all backgrounds and identities can learn, work, and live together. We benefit from the academic, social, and cultural developments that arise from a diverse campus that is committed to equity, inclusion, belonging, and accountability.
We have a responsibility as a community and as individuals to address and remove barriers, achieve success, and sustain a culture of inclusivity, empathy, kindness, and compassion. We encourage, welcome, and embrace participation in ongoing dialogue, engagement, and education to critically examine and thoughtfully respond to the changing realities of our community.
Diversity, equity, inclusion, acceptance, and belonging enrich the Charger community and are instrumental to institutional success and fulfillment of the University mission.
Reporting Bias Incidents
At the University of New Haven, there is an expectation that all community members are committed to creating and supporting a climate which promotes civility, mutual respect, and open-mindedness. There also exists an understanding that with the freedom of expression comes the responsibility to support community members’ right to live and work in an environment free from harassment and fear. It is expected that all members of the University community will engage in anti-bias behavior and refrain from actions that intimidate, humiliate, or demean persons or groups or that undermine their security or self-esteem.
If you have an immediate safety concern for yourself or others, and/or believe someone poses an immediate threat to themselves or others, please contact University Police at 203-932-7070 or call 911. Community members can report bias-motivated incidents by completing the form at www.newhaven.edu/biasreporting. Community members are encouraged to complete this form if they are the target of bias or harassing behaviors, witness such behaviors, or gain knowledge of these behaviors occurring within the University community. All matters concerning bias and harassment will be handled by the Dean of Students Office and Human Resources Office.
