The mission of the "Financial Inclusion in Africa" project is to leverage machine learning to predict and enhance financial inclusion across Kenya, Rwanda, Tanzania, and Uganda. By identifying key factors that influence bank account ownership and usage, this project aims to provide actionable insights for policymakers, financial institutions, and development organizations.
The goal is to support efforts in designing targeted interventions, policies, and strategies that can effectively address financial exclusion, promote economic empowerment, and foster sustainable development in these regions.
Country Representatioin
Raw data shows a high relationship between owning a mobile phone and having a bank account
STEPS
Data Preparation and Import: Load and clean the dataset.
Initial Exploration: Understand data structure and key characteristics.
Target Variable Analysis: Investigate bank account ownership distribution.
Country Representation: Analyze respondent distribution across countries.
UniqueID Analysis: Check uniqueness and insights of the 'uniqueid' column.
Age Distribution: Study respondents' age distribution.
Household Size: Compare average household sizes by country.
Household Head Relationship: Analyze respondents' relationships with household heads.
Cell Phone Access: Examine cell phone ownership.
Education Level: Analyze education levels.
Job Type: Explore job type distribution.
Correlation Analysis: Identify feature correlations.
"""
custom function used to import prepare the data for model building
"""
def wrangle(file_path):
df = pd.read_csv(file_path)
# drop columns which are not useful in the model
df = df.drop(columns=['year', 'uniqueid'])
""" Encoding Categorical Features """
# Identify columns with categorical values for encoding
categorical_columns = [x for x in df.columns if type(df[x][1]) == str]
print(categorical_columns)
print(f"Our dataframe has {len(categorical_columns)} categorical columns")
# Instantiate label encoder
label_encoder = LabelEncoder()
# Transform data
for column in df.columns:
df[column] = label_encoder.fit_transform(df[column])
return df
STEPS
Data Wrangling and Preprocessing: Prepare the data for modeling, including handling missing values and encoding categorical variables.
Feature Matrix and Target Vector Splitting: Separate the data into feature matrix and target vector.
Handling Imbalance: Address class imbalance through techniques like oversampling.
Baseline Model: Develop a baseline machine learning model for initial predictions.
Model Iteration and Hyperparameter Tuning: Improve the model using pipelines and hyperparameter tuning.
Performance Evaluation: Evaluate model performance using metrics such as accuracy, precision, recall, and F1 score.
Results Communication: Visualize the results with confusion matrices and feature importance plots.
The random forest model achieved a test accuracy of 89.12%. This means that when the model was tested on new, unseen data, it accurately predicted whether an individual has a bank account or not 89.12% of the time.
To provide context, the baseline accuracy, which is often the accuracy achieved by a simple model or a naive approach (like always predicting the majority class), was 85.81%. Comparing the test accuracy of the random forest model (89.12%) to the baseline accuracy (85.81%) indicates that the model performs significantly better than a basic approach.
The feature importances highlight the significance of different features in predicting the likelihood of financial inclusion. This information is crucial for stakeholders seeking to understand the driving factors behind financial inclusion and make informed decisions.
Here is a brief explanation of the 3 most significant feature importances:
Education Level: The education level of individuals has the highest importance in predicting financial inclusion. This suggests that higher education may positively influence an individual's likelihood of having a bank account.
Job Type: The type of job held by individuals is the second most important factor. Certain job types may offer more financial stability or access to banking services, leading to higher rates of financial inclusion.
Age of Respondent: The age of respondents also plays a significant role. Younger individuals may be more likely to adopt banking services compared to older individuals.
Programming Language: Python
Libraries: Pandas, NumPy, Scikit-learn, Matplotlib, Seaborn
Data Processing: Data cleaning, preprocessing, and transformation
Modeling Techniques: RandomForestClassifier(), oversampling for imbalance handling, pipeline creation, hyperparameter tuning
Evaluation Metrics: Accuracy, precision, recall, confusion matrix, feature importance