Many people approaching the field of data science and data analysis, ask about programming languages. The two most frequent advice are to study either R or Python. However, saying Python without much context can be unobliging and lead the aspiring data analyst or scientist to lose their time on learning unneeded tools. To each field its tools and priorities: the following will explain what is required to start your journey as a data analyst with Python.
Why learn Python in 2021?
How to learn Python in 2021? (For beginners)
Calculator
The first step to learn Python is to use it as a calculator. Try to perform some basic calculations to familiarize with the operators and how they are defined in python (e.g. ‘**’ is ‘^’ in python). To find the list of operators you can check w3schools.
Creating variables
Creating variables is the most important thing to learn in python. Matter of fact, everything that you will do in python will revolve around defining variables.
Data Types
Data types are important to cover. It is essential that you understand how and why data types can or cannot work together. As part of types you can learn the different ways to define numbers (integer, float and complex numbers) and you can understand how casting works (e.g. transforming a number to string or vice versa). Finally, you will explore how Booleans work and how important they are, especially for creating masks in data analysis.
Data structure
List, tuples, sets, dictionaries and arrays. Learning these from a high level is crucial. If you do not learn them, you will never be a coder (not even a beginner). In-depth knowledge of these concepts can be beneficial for intermediate to advanced programming skills. However, low-level understanding means a bigger time investment. It is not worth it to ace data structures from the beginning but along the way, they should be revisited multiple times.
Loops (for/while) and conditions (if/else)
Loops are also essential for python. You need to know how to iterate over your data. The most important looping skills to have is list comprehensions. Once you cover loop comprehensions you could be assured that you are on the right track.
Writing functions
Learning python is all about writing concise and helpful functions to automate your tasks. With all the previously discussed, you should be able to write very basic functions by now, that will perform simple tasks.
Learn how to find help and answers to errors
Google and StackOverflow are the two best resources to find the answers to your coding-related questions. In fact, googling is a very valuable skill. The better you are at googling the easier your coding experience will get with time. Try to check a few videos on YouTube to improve your googling skills; you will be thankful for it.
Best books and resources to learn python
- Mosh Programming (YouTube)
- Automating the boring stuff (Book)
- W3Schools
- DataCamp
What 'Python for data analysis' means?
Personally, when I say python for DA, I mean learning to use notebooks (Jupyter(lab)). Notebooks are very helpful to showcase results easily and to perform analysis. However, it is highly not recommended to use notebooks to write scripts and anything related to software.
Here are my top reasons to use notebooks are:
- Easy to use
- Writing algorithm chunks
- Run smaller code chunks
- Good for visualization
- Good for displaying data (tabular data in particular)
What are the essential packages and libraries for data analysis?
These four packages/libraries are the most used in data analysis. Pandas and numpy will help in cleaning and manipulating the datasets. Matplotlib and Seaborn will help in visualizing the data. The four packages are the pillars of data science and without them, it is nearly impossible to do anything.
Pandas for data analysis
Pandas (a.k.a pd) is a foundational package for data analysis. Using pandas you will learn how to read data from files (excel, csv…). You will also learn how to manipulate and perform transformations on datasets. Pandas is mainly used for tabular data. Therefore, if you are familiar with excel transformations pandas will cover you well! In the following, I will mention some important functions on pandas.
- Reading files (pandas.read_excel, pandas.read_csv…)
- Pivoting (pandas.pivot, pandas.pivot_table)
- Merging and joining (pandas.join, pandas.merge)
- Applying a function (pandas.apply)
- DataFrames for tabular data (pandas.DataFrame)
- …
Numpy for data analysis
Numpy is the best package for numerical computing in Python. Numpy, with its powerful vectorization tools, allows fast matrices/vectors (also known as arrays in numpy lingo) operations which can be beneficial for dataframes and linear algebra applications. Numpy has also additional ‘add-ons’ to make it more efficient (however these tools are out of the scope of a beginner's learning). To get a sense of what is the potential of numpy:
- Creating matrices and vectors (Numpy.array)
- Creating distributions and random numbers (numpy.random.normal…)
- Dot products and more (numpy.dot)
- ...
Matplotlib and Seaborn for data analysis
Matplotlib and seaborn are useful for creating visuals. The syntax is simple and intuitive. The main difference between the two is that seaborn can perform advanced visualizations. For example, It is easier to draw multiple categories at once for comparison purposes. In addition, seaborn has smoother themes to be used to improve the overall quality of visuals.
A common workflow in pandas for data analysis
The first step in any data analysis project is preparing the data to be used for the actual analysis. Data is usually very ugly. Which means a big amount of time is going to be dedicated for the cleaning. The following is a basic workflow for each dataset that you work with:
- Importing the data correctly (headers are correct, index is correct…)
- Checking for missing values to understand how good or bad the data is
- Dropping duplicates to avoid issues along the analysis
- Performing transformations (wide to long, long to wide…)
- Verifying that the values of your transformed data are still good and nothing is missing
- Checking that all the values are of the right type
- Plotting a few graphs for a final check
Comments
Post a Comment