Pandas Tutorial

Friday. December 13, 2019 - 6 mins

This is a small summary of pandas commands, this is where I keep my pandas snippets for a case of need.

In case a deeper dive in to the subject is wanted, don’t hasitate to check out this much better tutorial. Let’s jump in:

Import pandas

import pandas as pd

Write/Read DF as .h5:

my_df.to_hdf(path_to_folder+'/my_df_saving_name.h5', 'my_df_saving_name')
my_df = pd.read_hdf(path_to_folder+'/my_df_saving_name.h5')

Write/Read DF as .json:

my_df.to_json(path_or_buf=path_to_folder+'/my_df_saving_name.json')
my_df = pd.to_json(path_or_buf=path_to_folder+'/my_df_saving_name.json')

Add a row to DF (note the use of ` ignore_index=True` which tells the DataFrame to set the index as row enumeration, as in a simple list).

my_df = my_df.append({'dir_name': dir_name, 'frame_name': frame_name,'false_positive': num_fp}, ignore_index=True)

Add a new column to DF:

my_df['false_negative'] = pd.Series(false_negative_list)

# note that len(false_negative_list) has to be equal to my_df.shape[0]

Access a certain column of the DF:

my_series = my_df['dir_name']
my_series = my_df.dir_name
my_series = my_df.loc[:,'dir_name']

A series can be turned in to a list using:

nparray = series.to_numpy()

Create iterator of rows of DF:

df_iterator = my_df.iterrows()

Collapse rows and apply manipulation over duplicates:

my_df = my_df.groupby('dir_name').agg({'false_negative':'mean','false_positive':'mean'})

# note that all coulmn values must be numeric, in case not true (for example for false_positive), can use: my_df['false_positive'] =  pd.to_numeric(my_df['false_positive'])

Drop duplicates (taking only the first value):

my_df_row_per_dir = my_df.drop_duplicates(subset=['dir_name'],keep='first')

Drop columns

my_df.drop(columns=['false_negative'])

Keep certain values of DF:

my_df_zero_fp = my_df[my_df['false_positive']==0]

Count occurrences in a certain column:

my_df['false_positive'].value_counts()

# this will produce a table in which each row holds a value (of 'false_positive') and the number of occurences, for example:

false positive value  -  number of occurences
        5    				   1218837
        1  			       	288189
        10   				    167364
        6   				    118085
        17  				    68663
        14  				    47808
        18  				    45225

Sort DF by column:

my_df.sort_values(by='false_positive',ascending=False)

View top / bottom:

my_df.head(3)
my_df.bottom(3)

Running over rows of the DF

Sometimes we wish to run over all objects in a specific DF, here is an example of working through

for df_row_num in df.iterrows():
    df_row = df.loc[df_row_num]

    ~do something using row as a one liner DF

Small helpers for saving metadata along with DF to .h5

def h5store(filename, df, **kwargs):
    store = pd.HDFStore(filename)
    store.put('mydata', df)
    store.get_storer('mydata').attrs.metadata = kwargs
    store.close()

def h5load(store):
    data = store['mydata']
    metadata = store.get_storer('mydata').attrs.metadata
    return data, metadata

metadata = {'City':'Tel-Aviv'}
h5store(filename_to_save, my_df, **metadata)

with pd.HDFStore(filename) as store:
    data, metadata = h5load(store)

Merge columns to a longer df

import pandas as pd
df = pd.read_csv('')

df = df.fillna('None')
df['A'] = df[['insight1','insight2','insight3','insight4']].values.tolist()
df['B'] = df[['insight_type1','insight_type2','insight_type3','insight_type4']].values.tolist()
df['C'] = df[['insight1_type_is_correct','insight2_type_is_correct','insight3_type_is_correct','insight4_type_is_correct']].values.tolist()

df = df.explode(['A','B','C'])
df = df.rename(columns={'A':'insight','B':'insight_type','C':'insight_type_is_correct'})
df = df.loc[df.insight != 'None']
df = df.loc[df.insight_type_is_correct != 'None']
df.insight_type_is_correct = df.insight_type_is_correct.apply(lambda x: 1 if x=='yes' else 0)

for i in range(len(df)):
    print(f'\ninsight: {df.iloc[i].insight}\ninsight_type: {df.iloc[i].insight_type}\ncorrect?: {df.iloc[i].insight_type_is_correct}')

Split to train/test/val

import numpy as np
df = pd.read_csv('/dccstor/yotamperlitz/statista/website_crawl/free/dataframes/statista100k_for_appen_insight_classification.csv')
np.random.seed(0)
train_msk = np.random.rand(len(df)) < 0.9
train_df = df[train_msk]
test_val_df = df[~train_msk]
test_msk = np.random.rand(len(test_val_df)) < 0.5
test_df = test_val_df[test_msk]
val_df = test_val_df[~test_msk]
train_df.to_csv('/dccstor/yotamperlitz/statista/website_crawl/free/dataframes/statista100k_train.csv')
val_df.to_csv('/dccstor/yotamperlitz/statista/website_crawl/free/dataframes/statista100k_val.csv')
test_df.to_csv('/dccstor/yotamperlitz/statista/website_crawl/free/dataframes/statista100k_test.csv')