Association Pattern Rule Mining – Tai Duc Phan Portfolio

Welcome to my in-depth exploration of linear regression. While you’ve likely encountered this ubiquitous technique before, I’m dedicating a series to it for one reason.

My understanding of linear regression evolved from something trivial – just finding a line to fit data – to a fundamental tool after diving into econometrics. There’s surprising depth and meaning to uncover, making it a foundation for more advanced regression methods.

I hope you’ll join me on this journey to uncover the power and nuances of linear regression.

And just so you know, I will base my code, formulas and all based on the book “Introduction to Linear Regression Analysis” – Third Edition from the Wiley Series in Probability and Statistics. Here on my website, I only state my findings on the dataset using the techniques, for the code and further formulas notations, I will attach the link to my ipynb files or Kaggle.

Dataset Loading

For our study, we use the Life Expectancy (WHO) dataset on Kaggle: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who. Here is a lists of the datasets attributes:

Country: The country for which the data is being reported (object).
Year: The year the data is being reported for (int64).
Status: Whether the country is considered ‘Developed’ or ‘Developing’ (object).
Life expectancy: The average life expectancy in years (float64).
Adult Mortality: The rate of adult mortality per 1000 people (float64).
infant deaths: The number of infant deaths per 1000 people (int64).
Alcohol: The alcohol consumption rate per capita (float64).
percentage expenditure: The percentage of government expenditure on health as a percentage of GDP (float64).
Hepatitis B: The percentage of children aged 1 year old immunized against Hepatitis B (float64).
Measles: The number of reported Measles cases per 1000 people (int64).
BMI: The average Body Mass Index of the entire population (float64).
under-five deaths: The number of deaths of children under 5 years old per 1000 live births (int64).
Polio: The percentage of children aged 1 year old immunized against Polio (float64).
Total expenditure: Government expenditure on health as a percentage of total government expenditure (float64).
Diphtheria: The percentage of children aged 1 year old immunized against Diphtheria (float64).
HIV/AIDS: Deaths per 1000 live births due to HIV/AIDS (float64).
GDP: The Gross Domestic Product per capita (float64).
Population: The total population of the country (float64).
thinness 1-19 years: The percentage of the population aged 1-19 years who are considered thin (float64).
thinness 5-9 years: The percentage of the population aged 5-9 years who are considered thin (float64).
Income composition of resources: Human Development Index in terms of income composition of resources (float64).
Schooling: The average number of years of schooling received (float64).

However, we are not going to use the Country and Year variable, due to which we will drop these two variables:

Since the chapter is about simple linear regression, we will only perform a regression analysis to see the relationship between Life Expectancy and Schooling, i.e, whether there is any relationship between these two variables. First, we draw a scatterplot to see whether there is any correlation. The visualization suggests that there be some linear relationship. But we will dive further.

Least-squares estimation of the parameters

Using derivatives to find global extrema, we will see that, if we consider y to be the life expectancy and x to be the number of schooling years received, and consider the relationship between x and y to be a linear equation, I found out an approximation equation:

LifeExpectancy (in years) = 2.289*Schooling (in years) + 41.55

Afterwards, calculating the residual mean square, I obtained that the variance of my prediction for each y based on x would be approximately 36.43, which mean the standard error of regression would be around 6 years, which is not terribly bad considering we are accounting for number of schooling years received only. Of course, we are speaking under the assumption that the variance of prediction is similar for each of the value x in the data range. In order to be sure, it would be wise to conduct the next section.

Hypothesis testing on the slope and intercept

Using the t-tests, with the assumption that the estimated slope is a linear combination of the response values in the datasets, and the the errors are normally distributed, I rejected the null hypothesis, that is there is no linear relationship between Life expectancy and Number of schooling years. The same goes when I tried the F-test for ANOVA analysis instead.

Interval estimation in simple linear regression

In the end, using the assumption of normal error distributions, I see that the interval for my beta_1 value, which is the coefficient of Number of schooling years in the regression equation is (2.185456348228645, 2.394116692685223) with confidence level 95%, this means our quality of regression, under the conditions that the assumptions are true, is pretty nice.

Also, the variance of the estimation of response for each value of x with confidence level 95% is the interval

(34.06976268136923, 39.058074347577495)

And finally, the coefficient of determination of the regression model is roughly 0.5, which means approximately half of the variance in the response values.

Welp, that is all there is to it, when we are just basing on one variables. Here is the link to my files: https://www.kaggle.com/code/ticphan/simple-linear-regression

In this blog post, we’ll delve into the fascinating world of retail transaction analysis, where we uncover hidden patterns in customer buying behaviors. This project, available on Github, utilizes association rule mining techniques to extract valuable insights from real-world data. Link to my project on GitHub: https://github.com/Taiducphan1234/Retail-Transactions-Analysis-and-Association-Pattern-Rule-Mining

Unearthing Buying Habits

Our primary goal was to identify frequently purchased items together (frequent itemsets) and understand overall customer purchasing patterns. We achieved this by employing association rule mining, a powerful technique that discovers relationships between different items in a customer’s basket.

The Datasets: A Glimpse into Customer Baskets

For our analysis, we leveraged two datasets:

Synthetic Retail Transactions Dataset: This simulated dataset provided a controlled environment to experiment with various algorithms and compare their performance.

Figure 1: A view of the Retail Transactions Dataset

Groceries Purchase Dataset: This real-world dataset, containing actual customer purchases, allowed us to evaluate the effectiveness of the discovered rules in a more practical setting.

Figure 2: A view of the Groceries Purchase Dataset

The Algorithm Arsenal: Unveiling Frequent Itemsets

To crack the code of customer buying patterns, we employed a trio of association rule mining algorithms:

Apriori: A classic and well-established algorithm for frequent itemset mining.
FP-Growth: A more efficient alternative to Apriori, particularly for dealing with large datasets.
Enumeration Tree: A less commonly used but potentially efficient approach for frequent pattern discovery.

Data Visualization: Making the Invisible Visible

To effectively communicate the patterns and trends unearthed from the data, we utilized the following data visualization libraries:

Plotly: An interactive library for creating visually appealing and informative graphs.
Matplotlib: A fundamental library for generating a wide variety of plots.
Seaborn: A library built on top of Matplotlib that offers a high-level interface for creating statistical graphics.

A Deeper Look: Beyond the Surface

Before diving into the world of association rule mining, we conducted a thorough exploratory analysis of both datasets. This initial phase aimed to:

Gain a comprehensive understanding of the data’s characteristics.
Identify any potential issues or inconsistencies.
Extract insights into customer buying behavior through techniques like data summarization and visualizations.

A Treasure Trove of Findings

This project yielded a rich harvest of results, including:

Exploratory Data Analysis: We unraveled the underlying structure of the datasets, uncovering trends and patterns in customer purchases.
Frequent Itemset Mining: We identified frequently bought together items in both the synthetic and real-world datasets. (Consider including an image here showcasing some of the discovered frequent itemsets)
Evaluation of Discovered Rules: We assessed the effectiveness of the discovered patterns using metrics like lift and support on the groceries purchase dataset.
Algorithmic Performance Comparison: We compared the efficiency and effectiveness of the different association rule mining algorithms employed.

The Road Ahead: Further Exploration

This project lays a strong foundation for further exploration in the realm of retail transaction analysis and customer behavior. Here are some exciting possibilities for future endeavors:

Delving deeper into more sophisticated association rule mining algorithms for potentially even richer insights.
Implementing innovative techniques to visualize the discovered patterns in even greater detail, allowing for a more intuitive understanding of customer behavior.
Integrating the findings from this project into a real-world recommendation system, enabling retailers to personalize product suggestions for their customers.

By leveraging the power of association rule mining and data analysis, we can unlock valuable insights into customer behavior, ultimately leading to more informed business decisions and enhanced customer experiences.

Get Involved!

Head over to the project’s Github repository to explore the code, delve deeper into the analysis, and I am so delighted to receive any feedback and recommendations on renovations.

Contact me

+49 176 3430 8250
phantaiduc0807asd@gmail.com

← Back