Nqobile Msibi

Data analyst/Data engineer

Big Data, Data, Technology, Cryptocurrency

Circuit board. Motherboard. Blue technology background.

B I O

I am an aspiring data professional coming out of IT with a passion for data-driven storytelling and problem solving I am a critical thinker with a strong agenda.

I will provide data solutions and valuable insights. I recently completed an extensive training to enhance my data analytics skills. I am a Power BI specialist, creating reports and dashboards and models to identify key insights. I also have strong Python skills for data engineering & analytics projects.

With years of SQL experience, I can write complex queries to extract and transform relational data. With my skills in Apache Airflow, I can create and maintain scalable ETL / ELT pipelines in Python. With my hands on Airflow experience, I can develop, operate and manage complex data operations.

Background

iStudent Academy

City & Guilds ICT Systems and principles

Wizeline

Data Engineering Bootcamp

Certifications

Microsoft Certified: Azure Data Fundamentals - Feb 2023

Specialization: BI Foundations with SQL, ETL and Data Warehousing IBM (Coursera) - Aug 2023

Specialization: IBM Data Warehouse Engineer IBM (Coursera) - Nov 2023

Microsoft Certified: Endpoint Administrator Associate - Jan 2023

Microsoft Certified: Power BI Data Analyst - Mar 2023

Datacamp: Data Analyst Associate - Sep 2023

Specialization: Business Analytics University of Pennsylvania (Coursera) - Jan 2024

Expertise

Excel
Tableau
PowerBI
Google Cloud
Azure
Python

Airflow Pipeline

project description

The project's goal is to construct an end-to-end data pipeline that feeds information into a data warehouse's fact table (fact_movie_analytics). Two CSV files (movie_review.csv and log_reviews.csv) and a PostgreSQL table (user_purchase) are used as the data sources.

Data transformations and ETL processes will be scheduled and orchestrated using Apache Airflow. The data from the many sources will be processed and combined using Spark. Following transformation, the data will be imported into the data warehouse's fact and dimension tables.

Business Problem

The company's requirement to create user profiles and insights from gathered user data is the source of the business issue. Currently, this data resides in separate sources and formats across the organization. The business requires a scalable, automated data pipeline that can:

Ingest user data from disparate sources

Clean, transform and process the data

Structure it into a data warehouse

Enable analysts to easily perform SQL queries for insights

github link:

https://github.com/NqobileMsibi/ETL-with-airflow.git

Google Play Store Dataset

project description

The research uses data analysis tools to look for trends and discover patterns. I tasked myself to use Google Play Store dataset to uncover some applications on the Google Play Store that do better than others, receiving good reviews and thousands of downloads. The project will outline certain trends that will offer suggestions for improving app performance on Google Play Store.

The business problem

Using data chosen from the Google Play Store that contains 10,000 apps to analyse key metrics such as number of installations, ratings and categories so we can learn more about how to better understand what makes an app highly successful so we can improve the number of installations and user engagement.

Task

Data import and cleaning:

Importing the data set into Excel
Identifying and correcting data errors, discrepancies and inconsistencies
Removing duplicate or irrelevant data
Restructuring and organizing the data for analysis

Data exploration:

Examining the data across to identify trends, correlations and outliers
Using filters, formulas and pivot tables to extract relevant information from the data
Calculating key metrics and performing comparisons across different groups

Data visualization:

Using Tableau to transform the raw data into insightful dashboards and interactive visualizations
Creating charts and graphs to illustrate patterns and relationships within the data in an intuitive, easy-to-interpret format

Reporting and analysis:

Preparing a comprehensive report summarizing my analysis process, key findings and business implications
Identifying areas of opportunity or underperformance based on the data
Recommending strategic, data-driven solutions to improve processes, performance and outcomes

Actions

Data cleaning:

Removed 483 duplicate rows
Using the filer tool removed the empty rows
Changed the price column since it's showing as a string column
Changed the size from M to MB and K to KB

Data exploration:

Using pivot tables found a lot of nan values in the ratings column, various with device in the size column and null values in the size column
Grouped the ratings for ease of use as the data would be inaccurate
Analyzed the data by calculating metrics like averages, percentages

Data visualization:

In Tableau, I created charts and graphs to visualize the data, spot trends and identify patterns.

Results

Based on the analysis we have discovered that the most popular category is the Family category with 1826 apps, accounting for 19% of all apps and the average rating is 4.2 out of 5 indicating that it's a successful category.
With over 5000 apps receiving above 4-5 rating, it shows that the everyone content rating is the most favourable rating. It shows that a large number of apps cater to all ages.
The massive gap in download rates, with free applications getting 92.2% of downloads compared to paid apps getting 7.8%. This might suggest that users are likely to try free apps more which may or not eventually lead them to decide whether they want to get a premium version or ad-free version of the app.

dataset

https://www.kaggle.com/datasets/lava18/google-play-store-apps

portfolio

https://docs.google.com/document/d/1IuISY_ln6ObUZwb32IEBY6ubZ83IHf37THSl0JhSsq4/edit?usp=sharing

Tableau

https://public.tableau.com/views/Google_16898027146250/Sheet6?:language=en-US&:display_count=n&:origin=viz_share_link

ETL Project

porject descirption

Personal loans are a lucrative revenue stream for banks. The typical interest rate of a two-year loan in the United Kingdom is around 10%. This might not sound like a lot, but in September 2021 alone UK consumers borrowed around £1.5 billion, which would mean approximately £300 million in interest generated by banks over two years!

You have been asked to work with a bank to clean and store the data they collected as part of a recent marketing campaign, which aimed to get customers to take out a personal loan. They plan to conduct more marketing campaigns going forward so would like you to set up a PostgreSQL database to store this campaign's data, designing the schema in a way that would allow data from future campaigns to be easily imported.

Github link:

https://github.com/NqobileMsibi/Bank-Marketing-Database.git

database model:

https://app.quickdatabasediagrams.com/#/d/gZ5ubT

Contact Details

ADDRESS

Phone

081 568 2799

Edleen, Kempton Park, South Africa

nqobilemsibi83@gmail.com