Juan D. Correa - Software Developer and Linux System Administration
astropema@gmail.com
Home About Code

Personal Research Notebooks Collection

This web page contains a set of over twenty technical studies I recently conducted across diverse domains including deep learning, statistical inference, natural language processing, image classification, and reinforcement learning. These projects were developed as part of a broader effort to integrate practical machine learning applications with philosophical, educational, and real-world exploration of data.

Each notebook reflects a different line of inquiry—ranging from modeling financial churn and predicting unemployment rates to simulating lunar landers and performing audio digit recognition using MFCC spectrograms. All were run on local Linux hardware and documented carefully for learning, reproducibility, and future deployment.


“Lunar Lander is a genre of video games loosely based on the 1969 landing of the Apollo Lunar Module on the Moon. In Lunar Lander games, players control a spacecraft as it falls toward the surface of the Moon or other astronomical body, using thrusters to slow the ship's descent and control its horizontal motion to reach a safe landing area. Crashing into obstacles, hitting the surface at too high a velocity, or running out of fuel all result in failure. In some games in the genre, the ship's orientation must be adjusted as well as its horizontal and vertical velocities.” — Wikipedia



If the notebook doesn't display properly, you can open it in a new tab.



“An inverted pendulum is a pendulum that has its center of mass above its pivot point. It is unstable and falls over without additional help. It can be suspended stably in this inverted position by using a control system to monitor the angle of the pole and move the pivot point horizontally back under the center of mass when it starts to fall over, keeping it balanced. The inverted pendulum is a classic problem in dynamics and control theory and is used as a benchmark for testing control strategies. It is often implemented with the pivot point mounted on a cart that can move horizontally under control of an electronic servo system as shown in the photo; this is called a cart and pole apparatus.” — Wikipedia



If the notebook doesn't display properly, you can open it in a new tab.



This project aims to build and evaluate a recommendation system using the Amazon product ratings dataset. We explore multiple models, including KNN and SVD, and evaluate their performance using metrics such as RMSE, precision, recall, and F1 score. Hyperparameter tuning and cross-validation are performed to optimize the models.

Objectives:
• Explore and preprocess the Amazon product ratings dataset.
• Implement and evaluate different recommendation models (KNN, SVD).
• Perform hyperparameter tuning and cross-validation to optimize the models.
• Compare the performance of different models and provide recommendations.


If the PDF doesn't display properly, you can open it in a new tab.



Yelp was founded in 2004 to help people find great local businesses. Today, the website and their mobile application publish crowd-sourced reviews about local businesses as well as certain metadata that can help in customer decision-making.

The Yelp dataset is a large collection of user reviews, business metadata, check-ins, user social data, and tips across 10 cities in 4 countries. The original dataset is ~11GB, but in this case study, a smaller subset is used due to hardware limitations. Yelp uses automated systems to recommend the most helpful and reliable reviews to users.


If the notebook doesn't display properly, you can open it in a new tab.



Our goal is to track the location (and velocity) of a moving object, such as a ball, in 3-dimensional space. Gravity is allowed to act on the ball, and the initial position and velocities are assumed to be known.

We use noisy location estimates from a simulated sensor to estimate the true location and velocity of the object in 3D space.


If the PDF doesn't display properly, you can open it in a new tab.



Our goal is to track the location and velocity of a moving object, such as a car, in 2-dimensional space. The only information available is the initial location and velocity, along with a series of noisy velocity measurements as the object moves.

The key assumption is that the true velocity of the object remains constant — although the constant value itself is unknown.


If the PDF doesn't display properly, you can open it in a new tab.



The Hidden Markov Model (HMM) is a probabilistic model used to explain or derive the likelihood of random processes. It suggests that an observed event is connected to a set of probability distributions, rather than a step-by-step sequence.

In this context, a Markov chain is modeled with hidden states that depend on the underlying process. The goal of HMM is to uncover information about this Markov chain by observing the hidden states.

Specifically, given a process X with hidden states Y, the HMM assumes the probability distribution of Y at each time step is independent of the historical values of X.


If the notebook doesn't display properly, you can open it in a new tab.




Embedded HTML Preview:

If the HTML doesn't display properly, you can open it in a new tab.



Thera Bank has recently experienced a sharp decline in its number of credit card users. Credit cards generate significant revenue for banks through various fees: annual, transfer, cash advance, late payment, and international transaction fees.

A decrease in users poses a financial risk. To address this, the bank aims to analyze customer data to identify which users are likely to churn, and understand the reasons behind it.

As a Data Scientist at Thera Bank, your job is to develop a classification model to predict user churn and help improve customer retention strategies.


If the PDF doesn't display properly, you can open it in a new tab.



Over the past decades, major advances have been made in the field of audio recognition, especially using Deep Learning techniques. A common approach involves converting audio signals into spectrograms for visual processing and recognition.

Raw audio is typically represented as a waveform, which can be extremely data-heavy, even for short clips. This makes storage and computation expensive, especially at high sampling rates.

A more efficient alternative is to use spectrograms — specifically MFCC (Mel-Frequency Cepstral Coefficients) spectrograms. These are generated using Fourier or Short-Time Fourier Transforms and represent audio in a 2D space: time on the X-axis and frequency coefficients on the Y-axis. This allows the audio to be treated as image data for model training.


If the PDF doesn't display properly, you can open it in a new tab.



Image classification has become more accessible with the rise of deep learning, large datasets, and powerful compute resources. Convolutional Neural Networks (CNNs) are among the most popular and effective techniques used for image classification tasks.

Clicks is a stock photography company where photographers worldwide upload food-related images daily. Due to the high upload volume, manually labeling these images is time-consuming and inefficient, making it a perfect use case for automation through deep learning.


If the PDF doesn't display properly, you can open it in a new tab.



This study builds a binary image classifier to distinguish Pituitary Tumor MRI scans from "No Tumor" scans. The dataset, sourced from Kaggle, originally contains 3,264 images across four tumor classes, but this project focuses only on two: Pituitary Tumor and No Tumor.

A subset of 1,000 images is used — 830 for training and 170 for testing. The training set includes 395 No Tumor and 435 Pituitary Tumor scans. Data augmentation is applied to improve model generalization and reduce overfitting.

The project also demonstrates how performance can be enhanced by importing a pre-trained model and applying transfer learning techniques.


If the PDF doesn't display properly, you can open it in a new tab.



Agriculture remains one of the most labor-intensive industries, particularly when it comes to identifying and monitoring plant growth. Despite advancements in agricultural technology, manual recognition of different plants and weeds continues to demand significant effort.

This project explores how Artificial Intelligence and Deep Learning can automate the classification of plant seedlings, reducing the need for manual labor while improving accuracy and efficiency.

The goal is to enable faster identification of plant types, leading to improved crop yields, smarter resource allocation, and more sustainable agricultural practices over time.


If the PDF doesn't display properly, you can open it in a new tab.



Recognizing objects in natural scenes is a key challenge in deep learning. One prominent example is digit recognition in images captured from street-level environments — an essential capability for real-world AI applications.

The SVHN dataset contains over 600,000 labeled digits extracted from real-world photographs of street address numbers. It is widely used for training and evaluating convolutional neural networks in image recognition tasks.

Notably, Google has leveraged SVHN-trained networks to enhance map quality by transcribing address numbers from visual data, improving geolocation precision for buildings.


If the PDF doesn't display properly, you can open it in a new tab.



The footwear market is highly competitive, with brands like Adidas and Nike leading the space. As these giants battle for market dominance, analyzing their product segmentation can reveal valuable insights.

As a Data Scientist at a market research firm, your task is to analyze shoe product data across both men’s and women’s categories. The goal is to group products based on shared characteristics, helping to highlight key similarities and differences in their offerings.

These insights can inform strategic decisions around branding, product positioning, and market targeting.


If the PDF doesn't display properly, you can open it in a new tab.



Active investing in the asset management industry aims to beat the stock market’s average returns. Portfolio managers typically track a particular index and attempt to outperform it by constructing superior portfolios.

Portfolio construction involves selecting stocks with a higher probability of delivering better returns than the tracking index, such as the S&P 500.

In this project, we use Network Analysis to select a basket of stocks and create two portfolios. We simulate portfolio performance over a year and compare it to the S&P 500 index as a benchmark.


If the PDF doesn't display properly, you can open it in a new tab.



Problem Statement:

80% of all the visitors to Lavista Museum end up buying souvenirs from the souvenir shop at the Museum. On the coming Sunday, if a random sample of 10 visitors is picked:

1. Find the probability that every visitor will end up buying from the souvenir shop.
2. Find the probability that a maximum of 7 visitors will buy souvenirs from the souvenir shop.


If the PDF doesn't display properly, you can open it in a new tab.



In this problem, we study a time-varying criminal network that is repeatedly disrupted by law enforcement. The data for this study originates from the CAVIAR.zip archive.

Background:
The CAVIAR project documents a two-year investigation from 1994 to 1996. The operation involved cooperation between the Montréal Police and the Royal Canadian Mounted Police. Several individuals were tracked and arrested as part of the study, providing insights into the behavior and structure of organized criminal networks under pressure.


If the notebook doesn't display properly, you can open it in a new tab.



This project explores the fundamentals of neural networks through simple and complex function approximations using both TensorFlow and PyTorch.

• A TensorFlow-based linear model was used to learn a quadratic function.
• A PyTorch-based feedforward neural network tackled more complex non-linear relationships.

These examples demonstrated core neural network concepts such as:
  • Forward propagation: Passing input through layers to compute outputs.
  • Activation functions: Adding non-linearity to capture complex patterns.
  • Weight and bias analysis: Observing how the model learns internal representations.


If the notebook doesn't display properly, you can open it in a new tab.



This project analyzes global meteorite landings using clustering and regression techniques.

Geographic Spread:
Meteorites are widely distributed across temperate and equatorial regions, with limited presence in polar and oceanic zones due to accessibility limitations.

Summary Statistics:
The average log-transformed mass is 3.40 (approximately 30 grams). Cluster 1 contains smaller meteorites, while Cluster 2 includes the largest and primarily iron-based types.

Analysis Steps:
The dataset was cleaned, and mass values were transformed using a logarithmic scale. Regression models (linear, random forest, gradient boosting) were applied to predict mass. Latitude and year emerged as the most important features.

Clustering:
K-Means clustering revealed four distinct groups, each with characteristic mass and geographic profiles. DBSCAN identified one high-density cluster and several outliers. Cluster 2 contains large iron-based meteorites; Cluster 1 shows broader geographic diversity with smaller mass.

Visualizations:
Scatter plots, heatmaps, and boxplots were used to explore geographic distribution, mass, and classification types.

Key Takeaways:
Meteorite discoveries are biased toward land areas, and larger meteorites tend to be geographically and compositionally distinct. The models achieved moderate performance (R² ~ 0.6). Including additional features like environmental data may improve predictions.

Future Work:
Improvements could include tuning clustering parameters, expanding temporal trend analysis, and integrating external datasets such as meteor shower timelines or terrain data.


If the notebook doesn't display properly, you can open it in a new tab.



Hospital management is a vital area that gained a lot of attention during the COVID-19 pandemic. Inefficient distribution of resources like beds and ventilators can lead to serious complications. However, this can be mitigated by predicting the Length of Stay (LOS) of a patient before admission. Once this is determined, hospitals can better plan treatment, resource use, and staffing to reduce LOS and improve recovery rates.

HealthPlus Hospital has been facing losses in both revenue and life due to inefficient management. The absence of a reliable system to allocate beds, equipment, and staff has worsened the issue.

In this first study, we begin exploring how LOS prediction can enable smarter treatment pipelines and optimize hospital infrastructure. We introduce the problem scope, data structure, and prepare initial exploratory analysis to understand the underlying variables.


If the notebook doesn't display properly, you can open it in a new tab.


In Part 2, we build on the earlier foundation by engineering relevant features and applying classification algorithms to predict Length of Stay (LOS) categories. We test a variety of models including decision trees, XGBoost, and ensemble learning approaches.

Cross-validation and metric evaluation (precision, recall, F1 score) guide the selection of optimal models. Through this, we begin to expose hidden patterns in patient admission records that strongly correlate with predicted LOS durations.

This analysis provides insight into how data-driven models can help streamline hospital operations and forecast patient resource demands.


If the notebook doesn't display properly, you can open it in a new tab.


The final installment, Part 3, focuses on deploying the predictive LOS model and visualizing its performance in simulated hospital scenarios. We explore interpretability techniques, such as feature importance plots and SHAP values, to explain how predictions are formed.

Our analysis demonstrates how such a system could help HealthPlus Hospital proactively manage patient flow, staff workload, and resource availability — ultimately improving efficiency and care quality.

By concluding this three-part study, we offer HealthPlus a robust, reproducible data science pipeline for LOS prediction, supported by transparent results and actionable insight.


If the notebook doesn't display properly, you can open it in a new tab.



A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use forecasting to estimate weekly, monthly, quarterly, and annual sales totals.

Accurate forecasting adds value across the organization — helping departments coordinate future planning, procurement, and budgeting. It guides decision-making around regional sales strategies and enables supply chains to align with expected demand.

This study explores historical sales data from a karting product company to build a forecasting model. Key goals include visualizing sales patterns, identifying influential drivers, and training models capable of predicting future performance by region or category.

The end result supports risk reduction, territory coverage planning, and the development of benchmarks to monitor trends and future deviations.


If the notebook doesn't display properly, you can open it in a new tab.



There is a huge demand for used cars in the Indian market today. As new car sales have slowed in recent years, the pre-owned car market has grown steadily, now surpassing the new car market in volume.

In 2018–19, approximately 4 million second-hand cars were sold, compared to 3.6 million new cars. This trend suggests a shift in demand toward pre-owned vehicles — including among car owners who opt to replace their vehicles with used ones instead of purchasing new.

Cars4U is a budding tech start-up seeking to establish itself in this growing market. Unlike the deterministic pricing of new vehicles (regulated by OEMs), used car prices vary dramatically due to fluctuating supply, demand, and vehicle condition.

This project explores data-driven pricing strategies for used vehicles, helping Cars4U gain a competitive advantage in a volatile and dynamic marketplace.


If the notebook doesn't display properly, you can open it in a new tab.



ExtraaLearn is an early-stage startup offering programs on cutting-edge technologies to help students and professionals upskill or reskill.

With a high volume of leads generated daily, a key challenge is identifying which leads are most likely to convert into paying customers. Efficient resource allocation depends on understanding this conversion likelihood.

As a data scientist at ExtraaLearn, your tasks are:
  • Analyze and model lead conversion likelihood using machine learning techniques.
  • Identify key features and patterns that influence conversion.
  • Create a data-driven profile of leads most likely to convert.


If the notebook doesn't display properly, you can open it in a new tab.



In this case study, we aim to construct a linear regression model that explains the relationship between a car's mileage (measured in mpg) and its various attributes such as engine size, weight, number of cylinders, and horsepower.

The objective is to gain insights into which features most significantly impact fuel efficiency and to build a predictive model to estimate mpg for future vehicles.


If the PDF doesn't display properly, you can open it in a new tab.



Latent Dirichlet Allocation (LDA) is a generative statistical model commonly used in natural language processing to uncover hidden thematic structures in a large corpus of documents.

The fundamental idea is that a topic can be viewed as a distribution over words, and each document is a distribution over topics. By learning the latent topic proportions, we can cluster and classify documents even without prior labeling.

This project explores the application of LDA using a real dataset of human-written text, focusing on scalable topic modeling, variational inference, and visualization of topic distributions.


If the PDF doesn't display properly, you can open it in a new tab.



This study analyzes a genomic sequence fragment from *Caulobacter crescentus*, a bacterium known for its highly structured cell cycle.

The input data is a 300KB DNA sequence using only four characters (A, C, G, and T), with no separators. While the sequence appears random, underlying structure exists. Statistical and machine learning tools are used to uncover it.

The approach includes:
  • Splitting the sequence into non-overlapping 300-character substrings.
  • Counting all 1-, 2-, 3-, and 4-mers (k-mers) within each substring.
  • Reducing dimensionality with PCA to visualize internal patterns.
  • Applying K-Means clustering to differentiate meaningful sequences (potential genes) from statistical noise.
This project highlights how unsupervised learning techniques like PCA and clustering can aid in deciphering unlabelled biological data.


If the notebook doesn't display properly, you can open it in a new tab.



Understanding socio-economic indicators is essential for evaluating national development, policy impact, and global inequality. While GDP is often used as a proxy for economic health, it does not fully capture societal well-being.

This project explores a rich dataset containing multiple socio-economic attributes from countries worldwide. Through clustering and visualization techniques, the study aims to uncover natural groupings of nations based on development patterns, education, health, income distribution, and access to infrastructure.

These clusters can guide targeted development strategies and provide insight into global economic structures beyond GDP.


If the notebook doesn't display properly, you can open it in a new tab.



In this case study, we explore Principal Component Analysis (PCA) using an air pollution dataset that captures 13 months of data on major pollutants and meteorological variables in a metropolitan city.

PCA is applied to reduce the dimensionality of the dataset while preserving as much variability as possible, revealing latent patterns and underlying correlations between pollutant levels and environmental conditions.


If the notebook doesn't display properly, you can open it in a new tab.



Hypothesis testing is a statistical method used to determine if there's enough evidence to support a claim about a population parameter. It involves comparing observed data against a null hypothesis and an alternative hypothesis, drawing conclusions based on calculated test statistics such as p-values and confidence intervals.


If the notebook doesn't display properly, you can open it in a new tab.



This notebook offers a brief but practical exploration of popular statistical libraries available in Python. It covers foundational functionality, key differences between libraries like `NumPy`, `SciPy`, `StatsModels`, and `Pingouin`, and illustrates how these tools support descriptive and inferential statistics in modern data workflows.


If the notebook doesn't display properly, you can open it in a new tab.