Share on facebook
Share on twitter
Share on linkedin
Share on pinterest

Pandas Introduction

In this article, we’ll discuss Pandas, which is the most popular python data analysis library.

Data analysis is a process of cleaning, exploring, organizing, describing, and visualizing data.

Pandas is mainly used for cleaning and exploring the data.

Two primary data structures in pandas are: Series and DataFrame both built on top of numpy.

PANDAS SERIES:

A pandas series is a one-dimensional array that contains a sequence of values. It’s like a Numpy array, but it differs by having an index. Unlike Numpy array, which is homogeneous, pandas series can hold values of different data types.

Initializing Pandas Series:

We can initialize the pandas series in multiple ways. The most common way is to use python containers such as lists or dictionaries.

The output consists of two columns. The first is the index, while the second is the values.

dtype: int64 denotes that the data type of the values in the Series is int64.

Since we did not explicitly specify an index, a default index [0 …… N-1] will be created where N is the length of the data.

The following code will create a series with a customized label.

Let’s see how to create a series using the python dictionary.

When using a dictionary, the keys of the dictionary are used as the index labels

You can also create a series using numpy functions

We can examine the index of the series using the property index, and the values can be retrieved using the property values.

Peek into Data:

By using the head(), we can peek the data at the beginning. By default, it returns the first five rows of the data.

We can also specify the number of rows. For example, the following code will print the first three rows of the data.

Similarly, we can peek the last five rows of our data using the tail() method.

To count the number of elements in the series, we can use the size or shape attribute.

Alternatively, we can also use the len function.

The data in the pandas series can be accessed in two ways: either through their index labels or through their numerical position.

The following code demonstrates how to access the value using index labels.

We can also access multiple row values by passing a list of index labels.

PANDAS DATAFRAME:

A DataFrame is a two-dimensional data structure made up of rows and columns.

Unlike Series, which had an Index array with associated labels for each element, a DataFrame has two indexes: a column index and a row index.

We can create a DataFrame in multiple ways. The following code creates a DataFrame using a dictionary of lists.

Now let’s see how we can create a DataFrame using a numpy array.

Practically we are less likely to use python containers to create a DataFrame. We often use pandas data reading methods to create a DataFrame. 

Pandas have an awesome array of data readers for various types of formats, from CSV through SQL databases to pickle files. 

The following code will read a CSV file and save it as a pandas DataFrame object.

You can download the dataset here.

We can quickly peek at the beginning of the data by using head()

To get the number of rows and columns, we can use the shape attribute.

The shape attribute returns a tuple in which the first value is the number of rows, and the second number is the number of columns.

The names of the columns and its type in a DataFrame can be accessed through the columns and dtypes property.

Similarly, the indexes are accessible via the index property

Getting info about your data:

We can get more information about our data by using the info method.

It will provide the number of rows, names, and types of your columns and memory usage of our data frame.

To get a summary of some basic statistics like percentile, mean, standard deviation, etc. we can use the describe method.

Change Index:

By default the index is incremental integers, starts with 0 and continues until n-1. 

However, you can change the index to be one of your columns by using set_index.

In the following code, we use the set_index method to set our DataFrame’s index to the state column.

Note that the original ‘data’ object is changed when inplace=True

We can use reset_index() to reset the index to its default values.

Note that we didn’t set inplace = True. So, the operation won’t affect the original data object.

Subsetting rows and columns:

If we want to access the data in individual columns, we can use the [ ] operator.

To specify multiple columns, we need to pass in a Python list

We can subset rows in various ways, for label based indexing we can use loc, and for positional based indexing we can use iloc.

If you want to get the numerical location of the index, you can use the index.get_loc() method.

We can also get the value of a particular cell. For example, let’s find out how many schools are there in Maharashtra.

Alternatively, we can also use iloc if we need to use the numerical position

Renaming column names:

We can use the .rename() method to rename the column names.

We’ll remove the parenthesis from the columns population and literacy rate and replace white space in the school column with an underscore.

We can also rename indexes in the same way.

Iterating through rows and columns:

By using the iterrows() method we can iterate through the rows

Another way to iterate rows is to use itertuple(). It returns a named tuple.

To iterate over columns, we can use iteritems(). It iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.

Creating and Deleting a column:

Let’s see how to add a new column to a pandas DataFrame.

We can create a new column by using the [ ] operator.

We’ll create a new column pop_rank which will contain the rank of the population.

We can also use the assign() method to add a new column to a pandas DataFrame. It returns a new object with all original columns in addition to new ones.

We can remove a column by using the drop() method. The following code drops the lit_rank column.

Rearrange Columns:

We can change the order of the columns in a pandas DataFrame easily. We just need to pass the list of column names in the desired order.

Save DataFrame as CSV:

Once we are done, we can save the DataFrame as a CSV file. To demonstrate saving data to CSV file, we’ll save the df object to a new file named data_new.csv

SUMMARY:

In this article, we learned about some of the basics of Pandas, the most-popular data analysis library for Python.

There are two main data structures in pandas: Series and DataFrame.

Series: A pandas series is a one-dimensional array that contains a sequence of values. It’s similar to a Numpy array.

DataFrame: A DataFrame is a two-dimensional data structure made up of rows and columns.

With a series of examples, we also learned some of the standard operations of pandas for data analysis.

Love What you Read. Subscribe to our Newsletter.

Stay up to date! We’ll send the content straight to your inbox, once a week. We promise not to spam you.

Subscribe Now! We'll keep you updated.