Skip to main content

How to become a Data Analyst in 2023

Data analysis skills are one of the hottest skills that have been in high demand on the job market for the past few years. A "data analyst" job title is not new to the market, however, due to the growth of data generation and the facilitation of data storage provided by cloud computing, many companies have now the capabilities to store their big data and to derive insights and value from it. Data analysis has been and will stay a fundamental skill to have for most jobs. In the following, I will discuss how to start a career as a data analyst and how I was able to secure a job as a data analyst at a reputable company. Disclaimer Prepare yourself for the worse; learn more about that here . You should read it if You are looking for an internship or a junior opportunity as a Data Analyst. Data Analyst Trends A simple search of the term " Data Analyst " on google trends can show us a graph with a positive trend of the frequency of searches. We can observe that from 2...

Data Analyst day-to-day: Real Life Case Study


Many persons want to become data analysts and scientists. However, not too many people working in that field are telling exactly what to expect. Most videos on YouTube are very vague and made using datasets from Kaggle, which are in many cases clean (general structure), but require filling missing values.
At work, the data is ugly. In the following we will discuss why...

Who should Read it...

Anyone looking to start their career as a Data Analyst and co.

How to Read it

Fully. Or watch the video on Youtube.

Dataset

The dataset used in the following, can be considered as something to be expected from a client. It is a real life template, with all the values changed for privacy purposes.

dataset,cleaning,data analysis,data science

Now the issues with this data is that it is considered wide. So it isn't possible to visualize it on Python unless some transformations are applied to it. The ultimate goal is:
  1. To extract the year from "Period_2020", then extract the 01 from "Month 01" so we can create a reference date. 
  2. List the transactions vertically (long format).

Using Python...

  1. Importing the needed packages.
    • import pandas as pd
      from dateutil.relativedelta import *
      import datetime
      import matplotlib.pyplot as plt # for visualization
  2. Reading the excel sheet with the correct header.
    • file_name = "../Datasets/ToyDataset.xlsx"
      data = pd.read_excel(file_name, header=[1,2])
  3. Renaming the columns while conditioning on 'Unnamed' columns
    • data.columns = [x[1] if "Unnamed" in x[0] else '{}_{}'.format(x[0], x[1]) for x in data.columns]
  4. Dividing the dataset into 2 parts, one part can be considered as an index and will not be modified, the second part (values for the revenues) to be transformed into long (stacking).
    • part1 = data.loc[:,:"Itemcode"].copy()
      part2 = data.loc[:,"Period_2020_Month 01":].stack().reset_index().copy()
      part2.columns = ["index","date","value"]
  5. Merging the two parts again.
    • full_data = pd.merge(part1,part2,how="inner",left_index=True,right_on="index")
  6. Creating reference date.
    • full_data["Year"] = 2020
      full_data["Month"] = full_data["date"].str[-2:].astype(int)
      full_data['reference_date'] = [relativedelta(day = 31) + datetime.datetime(year,month,1) 
                                     for year,month in zip(full_data["Year"],full_data["Month"])]
      full_data["reference_date"] = pd.to_datetime(full_data["reference_date"])
  7. Cleaning the final output.
    • full_data.drop(columns=["index","Year","Month","date"],inplace=True)

Input versus the output...

Now that we have done the main work, we can see how the two datasets differ, in the following:
cleaned data,data analysis
The output (first 5 rows showing)

data analysis
The input (first 5 rows showing)

As you can see, transforming from wide to long will cause a duplication of the first three columns, however, the 'reference date' will be the new unique identifier.
In addition, with this new format, it becomes easier for us to choose one 'itemcode' and to create a graph that shows the revenues overtime made by the product in all the stores.

Visualizing the results...

subset = full_data[full_data["Itemcode"] == 2556]
one_item = subset.groupby("reference_date")["value"].sum().reset_index()
plt.plot(one_item["reference_date"],one_item["value"],"-o")
plt.xticks(rotation=30)
plt.show()
Using these few lines of code, we select one item and we sum all its revenues based on 'reference_date'. We end up having the following graph.
data analysis, visualization

Discussion

At school and universities or even sites like Kaggle, UCI... Datasets are usually cleaned and come in good formats. The data scientist/analyst wouldn't have to invest much time in cleaning the data or trying to read it correctly. However, in real life, sometimes you will receive 4 excel sheets per month for weekly sales. And they all have the same template. Therefore, the best way to process them is by creating a function/script that you run every time you receive a new dataset to transform it into the correct shape. The problem is that not every client will send the same template. Which means for every client you have, you will need a function for his template... 
The major problem is that the people filling the datasets, don't really know anything about coding. All they care about (in many cases) is to have a 'pretty' excel file. Unfortunately, in many cases, the prettier the excel sheet, the worse it gets to deal with it as data analysts. 

Comments