Pandas Tutorial
- 6 minsThis is a small summary of pandas commands, this is where I keep my pandas snippets for a case of need.
In case a deeper dive in to the subject is wanted, don’t hasitate to check out this much better tutorial. Let’s jump in:
Import pandas
import pandas as pd
Write/Read DF as .h5:
my_df.to_hdf(path_to_folder+'/my_df_saving_name.h5', 'my_df_saving_name')
my_df = pd.read_hdf(path_to_folder+'/my_df_saving_name.h5')
Write/Read DF as .json:
my_df.to_json(path_or_buf=path_to_folder+'/my_df_saving_name.json')
my_df = pd.to_json(path_or_buf=path_to_folder+'/my_df_saving_name.json')
Add a row to DF (note the use of ` ignore_index=True` which tells the DataFrame to set the index as row enumeration, as in a simple list).
my_df = my_df.append({'dir_name': dir_name, 'frame_name': frame_name,'false_positive': num_fp}, ignore_index=True)
Add a new column to DF:
my_df['false_negative'] = pd.Series(false_negative_list)
# note that len(false_negative_list) has to be equal to my_df.shape[0]
Access a certain column of the DF:
my_series = my_df['dir_name']
my_series = my_df.dir_name
my_series = my_df.loc[:,'dir_name']
A series can be turned in to a list using:
nparray = series.to_numpy()
Create iterator of rows of DF:
df_iterator = my_df.iterrows()
Collapse rows and apply manipulation over duplicates:
my_df = my_df.groupby('dir_name').agg({'false_negative':'mean','false_positive':'mean'})
# note that all coulmn values must be numeric, in case not true (for example for false_positive), can use: my_df['false_positive'] = pd.to_numeric(my_df['false_positive'])
Drop duplicates (taking only the first value):
my_df_row_per_dir = my_df.drop_duplicates(subset=['dir_name'],keep='first')
my_df.drop(columns=['false_negative'])
Keep certain values of DF:
my_df_zero_fp = my_df[my_df['false_positive']==0]
Count occurrences in a certain column:
my_df['false_positive'].value_counts()
# this will produce a table in which each row holds a value (of 'false_positive') and the number of occurences, for example:
false positive value - number of occurences
5 1218837
1 288189
10 167364
6 118085
17 68663
14 47808
18 45225
Sort DF by column:
my_df.sort_values(by='false_positive',ascending=False)
View top / bottom:
my_df.head(3)
my_df.bottom(3)
Running over rows of the DF
Sometimes we wish to run over all objects in a specific DF, here is an example of working through
for df_row_num in df.iterrows():
df_row = df.loc[df_row_num]
~do something using row as a one liner DF
Small helpers for saving metadata along with DF to .h5
def h5store(filename, df, **kwargs):
store = pd.HDFStore(filename)
store.put('mydata', df)
store.get_storer('mydata').attrs.metadata = kwargs
store.close()
def h5load(store):
data = store['mydata']
metadata = store.get_storer('mydata').attrs.metadata
return data, metadata
metadata = {'City':'Tel-Aviv'}
h5store(filename_to_save, my_df, **metadata)
with pd.HDFStore(filename) as store:
data, metadata = h5load(store)
Merge columns to a longer df
import pandas as pd
df = pd.read_csv('')
df = df.fillna('None')
df['A'] = df[['insight1','insight2','insight3','insight4']].values.tolist()
df['B'] = df[['insight_type1','insight_type2','insight_type3','insight_type4']].values.tolist()
df['C'] = df[['insight1_type_is_correct','insight2_type_is_correct','insight3_type_is_correct','insight4_type_is_correct']].values.tolist()
df = df.explode(['A','B','C'])
df = df.rename(columns={'A':'insight','B':'insight_type','C':'insight_type_is_correct'})
df = df.loc[df.insight != 'None']
df = df.loc[df.insight_type_is_correct != 'None']
df.insight_type_is_correct = df.insight_type_is_correct.apply(lambda x: 1 if x=='yes' else 0)
for i in range(len(df)):
print(f'\ninsight: {df.iloc[i].insight}\ninsight_type: {df.iloc[i].insight_type}\ncorrect?: {df.iloc[i].insight_type_is_correct}')
Split to train/test/val
import numpy as np
df = pd.read_csv('/dccstor/yotamperlitz/statista/website_crawl/free/dataframes/statista100k_for_appen_insight_classification.csv')
np.random.seed(0)
train_msk = np.random.rand(len(df)) < 0.9
train_df = df[train_msk]
test_val_df = df[~train_msk]
test_msk = np.random.rand(len(test_val_df)) < 0.5
test_df = test_val_df[test_msk]
val_df = test_val_df[~test_msk]
train_df.to_csv('/dccstor/yotamperlitz/statista/website_crawl/free/dataframes/statista100k_train.csv')
val_df.to_csv('/dccstor/yotamperlitz/statista/website_crawl/free/dataframes/statista100k_val.csv')
test_df.to_csv('/dccstor/yotamperlitz/statista/website_crawl/free/dataframes/statista100k_test.csv')