DA 350 - Advanced Methods for Data Analytics

Fall 2022

Your Professor:

Matt Lavin

My Email:

lavinm@denison.edu

My Office:

Burton D. Morgan Center 411

Office Hours

10 to 11:20 a.m. MW by appointment; drop-ins 2:30 p.m. to 3:30 p.m. on Thursdays

Our Classroom:

Burton D. Morgan Center 218

When We Meet:

1:30 p.m. - 2:50 p.m. MWF

Course Description

This course is designed to develop students' understanding of the cutting-edge methods and algorithms of data analytics and how they can be used to answer questions about real-world problems. These methods can learn from existing data to make and evaluate predictions. Students in DA 350 will encounter both supervised and unsupervised methods and will learn about topics such as dimensionality reduction, machine learning techniques, handling missing data, and prescriptive analytics.

The Fall 2022 section of DA 350 is part of a multi-step plan to develop three distinct "flavors" of upper-level data analytics coursework. We will cover all of the most important areas that any student in an Advanced Methods for Data Analytics course would cover, but we will focus on the lens of descriptive analytics, which will include emphasizing things like modeling for interpretability, using natural language processing (NLP) methods to work with text as data, and designing and deploying computer vision systems. In the spring, professors Wang and Bonifonte will teach additional flavors of DA 350 focused respectively on predictive and prescriptive methods. The idea of this division is to allow students to choose the flavor that best fits their program of study, and to create a system so that at least some students can take more than one flavor of advanced methods. Your participation in this section, and your feedback at the end of the term, will help us evaluate the efficacy of our approach and help us shape the Data Analytics curriculum.


Office Hours

This semester, I will be using a mix of drop-in office hours and in-person appointments via Google Calendar. For office hours by appointment, visit my appointment page, where you will see a real-time account of when I am available. My standard appointment slots will be divided into 20-minute blocks from 10 to 11:20 a.m. on Mondays and Wednesdays. Note that these appointment slots will disappear from my calendar once I've been booked. Please book appointments at least 24 hours in advance. If I ever need to cancel by-appointment office hours on a given day (say, for example, if I'm ill), I will update the calendar and email anyone with an appointment.

Drop-in office hours will be held in my office from 2:30 to 3:30 p.m. on Thursdays. For these, you will not need an appointment, but I will see students in the order they arrive, so there is no guarantee that I will have time for everyone on a given day. If your question is time sensitive, you should make an appointment. If I ever need to cancel office hours on a given drop-in day (say, for example, if I'm ill), I will e-mail the entire class.


Additional Norms and Policies


Here you will find information on required readings, import university policies, and course-specific policies like attendance and cell phone use.

Required Texts

Bruce, Bruce, and Gedeck, Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python. (2nd edition, O'Reilly Media, 2020), ISBN-13: 978-1492072942
Available at the bookstore or order a print edition online by matching the ISBN (about $15), supplemental materials available at https://github.com/gedeck/practical-statistics-for-data-scientists
Additional selected readings will be made available as html or pdf, and linked to the course website or shared via Canvas

Software

All projects in this course will be scripted and analyzed using Python. All of my demos will use Jupyter Notebooks as a programming environment. You are welcome to use Jupyter Notebook, JupyterLab, or any other IDE of your choice when writing code. Most assignments, however, will require you to turn in an .ipynb file along with your written project report. At this stage in the curriculum, it is expected that you can handle most if not all tasks pertaining to installation, version maintenance, path manipulations, and library installations.

Grading and Feedback

As a general rule, the expectations in this course are high, and I'm confident you can all do great work. The feedback I provide on assignments is designed to help you get there. My goal is to provide specific, relevant, and honest feedback when I grade your work. This will include constructive criticism, strategies for improvement, and guidance on how students can achieve success. I will not do "compliment sandwiches" just to begin and end on a positive remark, but this means that, when I praise your work, it's an honest (and I think more meaningful) act of praise. 

Regarding the major assignment rubric, it is adapted from the standards that the data analytics program uses for all its majors. I don't expect your work to meet the same standards as a graduating senior, but I think using the same categories on our rubric will help you keep these standards in mind as you work toward that level.

Item Description
Assignment Process: All materials are turned in on time and in the right place. Assignment directions are followed. Required components are all present and submitted on time.
Attention to Detail: The project is well organized, flows logically, and follows the all formatting guidelines, including attention to proofreading, proper citations, and language that is appropriate to a well-informed, non-technical reader.
Research Question and Research Design: The project has a focused and well defined research question that can be addressed with computational, data-driven analysis. The focal data set and method(s) are appropriate for the research question.
Data, Visuals, and Code: The data are fully described, properly sourced, and presented in appropriate ways. Visuals (tables, charts, graphs) are used effectively to describe multiple aspects of the research project (data, methods, or results). The paper provides sufficient details and/or points to supplementary materials that make the research reproducible by a technical reader (i.e, detailed footnotes, appendices, GitHub, code, etc.)
Data Analysis Methods: The method(s) used to test the research question is justified, validated, and applied appropriately; the student appropriately describes the strengths and weaknesses of the methods used; outside sources are used to justify how the methods are used and interpreted.
Reporting and Interpretation of Results: The results are interpreted correctly and clearly address the research question; the project discusses its limitations, the extent to which it can be generalized, and expansion to further research.
Ethical Considerations: The writing thoughtfully engages any ethical considerations of using the data, methods, and implications of communicating the findings.

Grade Breakdown

Item Percentage Comments
Attendance and Participation 10 Attendance will be taken every day. Late arrival counts as half an absence. Participation will be assessed using a mix of preparedness, speaking during class discussions, remaining attentive during lectures, and completing in in-class assignments.
Algorithm Presentation 5 Individual assignments
Quizzes 15 Individual assignments
Midterm Assessment 10 Take-home, individual assignment, cumulative to date.
Project-Based Assignments and Reports (Labs) 45 Individual and team-based assignments
Final Assessment 15 Take-home, individual assignment, cumulative.

Late Work

If you have a legitimate emergency such as a serious illness, a mental health emergency, or a death in the family, I will grant an appropriate extension with a new due date. The trade off is that work turned in this way is probably not going end up in my hand when I grade everything else, so it's going to get very sparse feedback. If you miss a deadline entirely without getting an extension, you will automatically lose 10 points off the top of your grade for each day it is late, in addition to any points you lose for the quality of the work. Retroactive and last-minute extensions will not be granted.

Distractions

Cell phones should be off and put away. Laptops are okay for notes and such but, when laptops are being used, you should not be messaging, using Facebook, etc. I will check screens regularly give you a verbal warning on your first offense. After that, I reserve the right to ask you to leave class and mark you absent if you are creating a distraction.

Being Prepared for Class

Coming to class prepared means that you have the day's reading in hand (printed or digital) and have come to class with a way to take notes (printed or digital). If you are not prepared for class, I reserve the right to grade as if you were absent for that day. Anything due on a given day is due at the start of class. Any digital submission of material is due by the time class starts on the day the hard copy is due. 

Disability Resources

If you are a student who feels you may need an accommodation based on the impact of a disability, you should contact me privately as soon as possible to discuss your specific needs. I rely on the Academic Resource Center in 020 Higley Hall to verify the need for reasonable accommodations based on documentation on file in that office.

Academic Integrity

Proposed and developed by Denison students, passed unanimously by DCGA and Denison’s faculty, the Code of Academic Integrity requires that instructors notify the Associate Provost of cases of academic dishonesty. Cases are typically heard by the Academic Integrity Board, which determines whether a violation has occurred, and, if so, its severity and the sanctions. In some circumstances the case may be handled through an Administrative Resolution Procedure. Further, the code makes students responsible for promoting a culture of integrity on campus and acting in instances in which integrity is violated.

Academic honesty, the cornerstone of teaching and learning, lays the foundation for lifelong integrity. Academic dishonesty is intellectual theft. It includes but is not limited to providing or receiving assistance in a manner not authorized by the instructor in the creation of work to be submitted for evaluation. This standard applies to all work ranging from daily homework assignments to major exams. Students must clearly cite any sources consulted--not merely for quoted phrases, but also for ideas and information that are not common knowledge. Neither ignorance nor carelessness is an acceptable defense incases of plagiarism. It is the student’s responsibility to follow the appropriate format for citations. Students should ask their instructors for assistance in determining what sorts of materials and assistance are appropriate for assignments and for guidance in citing such materials clearly.

Our Commitment to Liberal Arts Education

Denison's mission statement articulates an explicit commitment to liberal arts education. It emphasizes active learning, which defines students as active participants in the leaning process, not passive recipients. Denison seeks to foster self-determination and to demonstrate the transformative power of education. A crucial aspect of this approach is what Denison's mission statement refers to as "a concern for the whole person," which is why the university provides a "living-learning environment" based on individual needs and an overriding concern for community. This community is based on "a firm belief in human dignity and compassion unlimited by cultural, racial, sexual, religious or economic barriers, and directed toward an engagement with the central issues of our time."

In this class, we will discuss inequality directly. In many cases, you will asked to apply quantitative reasoning skills to these subject, which can be difficult because there is always the potential for the available data to complicate or contradict something you may feel very passionate about. In these cases, you should aspire to adopt an attitude of critical skepticism, i.e. wary of claims that are not supported by evidence but potentially willing to be persuaded by evidence if you find it compelling, and willing to give that evidence a fair hearing.

How we treat one another will be a cornerstone of these conversations. Denison's "Guiding Principles" speak of "a community in which individuals respect one another and their environment." Further, "each member of the community possesses a full range of rights and responsibilities. Foremost among these is a commitment to treat each other and the environment with mutual respect, tolerance, and civility." It's easy to treat someone this way when you like them and agree with their ideas, but the real challenge is treating those who differ from us with the same compassion and respect. However, I consider disruptive, deceitful, or hateful behavior to be breaches of these responsibilities. Bullying, trolling, hate speech, and harassment of any kind will not be tolerated.

Discrimination, Sexual Misconduct, and Sexual Assault

Essays, journals, and other coursework submitted for this class are generally considered confidential pursuant to the University’s student record policies. However, students should be aware that University employees are required by University policy to report allegations of discrimination based on sex, gender, gender identity, gender expression, sexual orientation or pregnancy to the Title IX Coordinator or a Deputy Title IX Coordinator. This includes reporting all incidents of sexual misconduct, sexual assault and suspected abuse/neglect of a minor. Further, employees are to report these incidents that occur on campus and/or that involve students at Denison University whenever the employee becomes aware of a possible incident in the course of their employment, including via coursework or advising conversations. There are others on campus to whom you may speak in confidence, including clergy and medical staff and counselors at the Wellness Center. More information on Title IX and the University’s Policy prohibiting sex discrimination, including sexual harassment, sexual misconduct, stalking and retaliation, including support resources, how to report, and prevention and education efforts, can be found at: https://denison.edu/campus/title-ix.


Assignments

Algorithm Presentation (5% of grade)

This assignment has two purposes: to practice public speaking skills and to cover a broader range of machine learning algorithms than our time constraints would other allow. Students working in self-selected pairs will focus on a particular method, drawn from a list of options that I will provide. They will find and select a peer-reviewed, quantitative paper that applies this method and give a 15-minute class presentation. Your presentation will:

1. Explain the method (the high-level algorithm, the theory behind it, its strengths and weaknesses)
2. Discuss how/why the authors of the paper use the method (background, data collection, design of experiment, results)

Quizzes (15% of grade)

This course has intermittent quizzes on material from readings and lectures. Quizzes are designed to measure how well you are integrating the material. There will be five quizzes in total, each of which will take place on a Monday.

Midterm Assessment (10% of grade)

The midterm assessment will be a take-home, individual assignment cumulative up to the date of the assessment. It will be open-book and will focus on questions that test your synthesis of the course content rather than information recall or rote learning. It is due Wednesday, October 12.

Project-Based Assignments and Reports (Labs) (45% of grade)

Labs will be a mix of individual and team-based assignments. They are generally problem-focused and will require working with data, writing Python code to solve a problem or analyze a question, and explaining your work in the form of a written report. Each lab assignment will have written instructions, which will be shared Github Classroom. Week's on the calendar marked "mini lab" will not have a full lab assignment but will typically entail completing a worksheet or a short written reflection.

Final Assessment (15% of course grade)

As with the midterm, the final assessment will be a take-home, individual assignment cumulative through the entire course. It will be open-book and will focus on questions that test your synthesis of the course content rather than information recall or rote learning. It is due at the start of our scheduled exam block (9 a.m. Sunday, December 18, 2022). Note: Denison policy does not permit me to give extensions on this assignment, so any late submission will receive a 0 grade.


Weekly Calendar

Weekly Rhythm

Monday Wednesday Friday
Quiz day (all weekly readings should be done by this day),  lecture, coding practice, and/or other activity Student presentations, lecture, coding practice, and/or other activity Lab day; turn in previous lab assignment by start of class

Week 1: Introducing Advanced Methods for Data Analytics

Monday, August 29, 2022

In Class: Student Introductions

Homework: Sign up for Github, Complete Course Survey

Wednesday, August 31, 2022

In Class: Discuss survey results

Homework: read Geron 111-142 (pdf on Canvas)

Recommended but Not Required: read Regression Analysis with Scikit-Learn (part 1 - Linear)

Friday, September 2, 2022

In Class: Lab 1: Linear Regression Revisited

Homework: Read Cohen et. al. Regression Analysis, 151-192 (pdf on Canvas)

*Note: For all subsequent weeks, this calendar does not reflect daily due dates. Use the "Weekly Rhythm" table align the week's material with day-to-day expectations.

Week 2: Interpretability
(Monday, September 05, 2022 - Friday, September 09, 2022)

This Week's Reading: Cohen et. al. Regression Analysis, 151-192 (pdf on Canvas)

This Week's Lab: Continue Linear Regression Revisted Part I

Recommended but Not Required Readings: "Linear Regression" (https://www.statsmodels.org/dev/regression.html)

Week 3: NLP Week 1
(Monday, September 12, 2022 - Friday, September 16, 2022)

This Week's Reading: Analyzing Documents with TF-IDF (https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf)

This Week's Lab: TF-IDF (mini lab)

Recommended but Not Required Readings: Understanding and Using Common Similarity Measures for Text Analysis (https://programminghistorian.org/en/lessons/common-similarity-measures)

Notes and Reminders: No Quiz on Monday

Week 4: NLP Week 2
(Monday, September 19, 2022 - Friday, September 23, 2022)

This Week's Reading: Vajjala-Practical-NLP-119-159; Bruce et. al., 237-248

This Week's Lab: Text Classification (using Logistic Regression, KNN)

Recommended but Not Required Readings: "Regression Analysis with Scikit-learn (part 2 - Logistic)" (https://programminghistorian.org/en/lessons/logistic-regression)

Week 5: NLP Week 3
(Monday, September 26, 2022 - Friday, September 30, 2022)

This Week's Reading: Bruce et. al. 208-236

This Week's Lab: Entities and Phrases mini lab

Recommended but Not Required Readings: "Spacy 101" (https://spacy.io/usage/spacy-101)

Notes and Reminders: Quiz on Monday, 9/16

Week 6: NLP Week 4
(Monday, October 03, 2022 - Friday, October 07, 2022)

This Week's Reading: Bruce et. al. 283-294; "Word Association Norms, Mutual Information, and Lexicography" (https://aclanthology.org/J90-1003/)

This Week's Lab: No lab assignment (we will discuss modeling word similarity)

Recommended but Not Required Readings: https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/

Week 7: NLP Week 5
(Monday, October 10, 2022 - Friday, October 14, 2022)

This Week's Reading: Vajjala-Practical-NLP-81-113

This Week's Lab: Word2Vec (mini lab)

Recommended but Not Required Readings: https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html

Notes and Reminders: Midterm due on Wednesday

Week 8: NLP Week 6
(Monday, October 17, 2022 - Friday, October 21, 2022)

This Week's Reading: "From Word Embeddings To Document Distances" (https://proceedings.mlr.press/v37/kusnerb15.pdf)

This Week's Lab: Word Mover's Distance

Recommended but Not Required Readings: https://radimrehurek.com/gensim/auto_examples/tutorials/run_wmd.html

Notes and Reminders: No class Monday

Week 9: Computer Vision Week 1
(Monday, October 24, 2022 - Friday, October 28, 2022)

This Week's Reading: Geron-Hands-On-ML-35-84

This Week's Lab: Basic Image Classification (mini lab)

Recommended but Not Required Readings: https://scikit-learn.org/stable/modules/cross_validation.html ; https://scikit-learn.org/stable/modules/grid_search.html

Notes and Reminders: Quiz on Monday, 10/24

Week 10: Computer Vision Week 2
(Monday, October 31, 2022 - Friday, November 04, 2022)

This Week's Reading: Elgendy-Vision-Systems-1-35

This Week's Lab: More Advanced Image Classification (due on Monday, 11/14)

Recommended but Not Required Readings: https://programminghistorian.org/en/lessons/computer-vision-deep-learning-pt1

Week 11: Computer Vision Week 3
(Monday, November 07, 2022 - Friday, November 11, 2022)

This Week's Reading: Elgendy-Vision-Systems-92-145

This Week's Lab: No additional lab assignment (images lab due Friday, 11/18)

Recommended but Not Required Readings: https://programminghistorian.org/en/lessons/computer-vision-deep-learning-pt2

Notes and Reminders: Takehome quiz handed out Monday, 11/7; due on Canvas Wednesday, 11/9

Week 12: Regression Redux
(Monday, November 14, 2022 - Friday, November 18, 2022)

This Week's Reading: "The effect of partisanship and political advertising on close family ties" (https://www.science.org/doi/full/10.1126/science.aaq1433)

This Week's Lab: Thanksgiving (mini lab) ... Due Friday, 12/2

Recommended but Not Required Readings: "Are politically diverse Thanksgiving dinners shorter than politically uniform ones?"(https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0239988)

Week 13: Thanksgiving Break
(Monday, November 21, 2022 - Friday, November 25, 2022)

Notes and Reminders: No class

Week 14: Missing or Insufficient Data Week 1
(Monday, November 28, 2022 - Friday, December 02, 2022)

This Week's Reading: Monarch-HITL-1-48

This Week's Lab: Human in the Loop (HITL) Mini Lab

Notes and Reminders: Quiz on Monday, 11/28

Week 15: Missing or Insufficient Data Week 2
(Monday, December 05, 2022 - Friday, December 09, 2022)

This Week's Reading: Elgendy-Vision-Systems-240-282

This Week's Lab: Transfer Learning (no assignment)

Week 16: Farewells
(Monday, December 12, 2022 - Friday, December 16, 2022)

Notes and Reminders: Wrap-up (class on Monday only)