Graduate School Projects
Graduate School Projects
This post contains descriptions of the projects I have completed during my graudate program at the University of Illinois at Urbana-Champaign broken out by course.
CS 425 Distributed Systems – Fall 2020
Language: C++
There were two projects in this course, and the second project built off the first project. Both projects were constructed within an emulator so that students did not have to worry about the nitty-gritty of network programming.
Project 1: Membership Protocol
In this project, I implemented a membership protocol for a distributed system. The protocol facilitated communication between different processes in the system so that every member of the system could maintain a list of all other members in the system.
Project 2: Distributed Key-Value Store
Using the membership protocol from the previous project, I built a distributed key-value store with a ring design. The system assigned primary, secondary, and tertiary replicas for each key-value pair inserted into the database. The system was designed to be fault-tolerant. I hope I can extend this project over the summer to include multi-threading and real network protocols.
CS 498 Applied Machine Learning – Fall 2020
Language: Python (Numpy, SciPy, SkLearn, PyTorch)
We implemented a different machine learning model in each of the 15 weeks of this course, so the following are just a selection. We used Python
Image Classification
Naturally, we classified the MNIST handwritten digits dataset. In this project, we compared classifiers on the untouched dataset, bounding box version of the data, and stretched bounding box version of the data. We used each version of the dataset to train a naive bayes classifier.
Neural Network with PyTorch
In this project, we built a Convolutional Neural Network to classify the CIFAR-10 Dataset. The assignment was based on PyTorch tutorials, and the goal was to tweak the example code to improve the accuracy.
My CNN had the same structure as the example, but I tweaked the kernel and in/out channels of the layers via trial and error. I used xaviere normal initializations for the convolutional layers. I used CrossEntropyLoss as a criterion and Adam as an optimizer.
EM Topic Model
I implemented the EM algorithm to cluter the NIPS dataset from the UCI Machine Learning repository. We classified the dataset with 30 topics and 100 iterations of the algorithm.
High Dimension Classification
In this assignment, I built a classifier for an accelerometer dataset from the UCI Machine Learning repository. The goal was to classify the data into one of 14 activities. First, I used vector quantization and hierarchical k-means to cluster the data, and then I used a decision forest to classify the histograms generated by the clustering.
CS 484 Parallel Programming – Spring 2021
Language: C++ (OpenMP, MPI)
Project 1
In this project, I implemented Matrix Transpose with tiling to increase cache access efficiency. I also implemented matrix multiplicaion with tiling and with the second matrix being transposed before multiplication. The final part of the project involved comparing and contrasting sequential and parallel versions of the same functions. I parallelized the functions with OpenMP.
Project 2
The first part of the project had us examine the differences between different task scheduling specifications. The second part of the project involved specifying OpenMP pragmas to properly synchronize threads, which was necessary to implement the Gauss-Seidel algorithm for the Discrete Poisson Equation with ghost cells.
CS 412 Intro to Data Mining – Spring 2021
Language: Python (Projects are language agnostic)
Pattern Mining
-
Frequent Pattern Mining: We were given a dataset of categories for different buisnesses, and we were asked to find all frequent patterns with minimum support of 0.01. I implemented the Apriori pattern mining algorithm to accomplish the correct result.
-
Sequential Pattern Mining: We were given a dataset of Yelp reviews and asked to find all frequent consecutive sequential patterns with minimum support of 0.01. I tried GSP and SPAN, but I ended up using PrefixSpan (which was first proposed by my professor!).
Clustering
-
Basic Clustering: I implemented the KMeans clustering algorithm. I actually used KMeans++ to initialize the algorithm after some trial and error.
-
Hierarchical Clustering: I implemented the Agglomerative clustering algorithm AGNES with different “link” functions. The different link functions would change the structure of each cluster.
-
Cluster Validation: I implemented Jaccard and NMI cluster validation to understand the quality of the given cluster labels.
Classification
-
Decision Tree: I implemented a Decision Tree of depth 2 and used a train/test split to train and evaluate the test data.
-
Naive Bayes: I implemented a Naive Bayes Classifier by hand to run on some example data.
STAT 578 Advanced Bayesian Modeling – Spring 2021
Language: R (JAGS)
Project 1
The main part of the project involved modeling a dataset containing the lengths of wikipedia articles. I implemented a mean-only normal model and a two-paramenter normal model with flat noninformative priors to generate a report with R Markdown.
Project 2
Using JAGS, I implemented a hierarchical model for a dataset from an investigation into a possible link between a genetic trait and heart attack risk. The heirarchical model had essentially noninformative hyperpriors.
Project 3
In this assignment, I used model checking methods like the Gelman-Rubin statistic and autocorrelation plots to evaluate a Markov Chain Monte Carlo model for some 2016 US Presidential election polling data.
Project 4
I used Markov Chain Monte Carlo to create a linear regression model for a Moore’s Law dataset. Then I checked the model for outliers with a T statistic based on the maximum standardized error.
Final Report
I used R Markdown and RJAGS to compile a report on an assigned dataset of arrests from San Francisco from 2015 to 2016. I built a logistic regression model that showed the chances of an individual being found with contraband varied by the race of the individual.
CS 513 Theory and Practice of Data Cleaning – Summer 2021
Language: Python, Datalog, SQL
Technologies: OpenRefine, SQLite
Final Project
I worked with 3 other students to clean the New York Public Library’s historical menu dataset. We used OpenRefine to manually transform and standardize the data. Then, we used SQLite to store our clean dataset and run integrity checks. We also demonstrated how our cleaning improved the quality of the data.