Uvod u Python Pandas

Šesti deo

Sortiranje podataka i frekvencijska analiza

In [1]:
import pandas as pd
df = pd.read_csv('../datasets/bikes.csv')
df.head(3)
Out[1]:
trip_id usertype gender starttime stoptime tripduration from_station_name latitude_start longitude_start dpcapacity_start to_station_name latitude_end longitude_end dpcapacity_end temperature visibility wind_speed precipitation events
0 7147 Subscriber Male 2013-06-28 19:01:00 2013-06-28 19:17:00 993 Lake Shore Dr & Monroe St 41.881050 -87.616970 11.0 Michigan Ave & Oak St 41.90096 -87.623777 15.0 73.9 10.0 12.7 -9999.0 mostlycloudy
1 7524 Subscriber Male 2013-06-28 22:53:00 2013-06-28 23:03:00 623 Clinton St & Washington Blvd 41.883380 -87.641170 31.0 Wells St & Walton St 41.89993 -87.634430 19.0 69.1 10.0 6.9 -9999.0 partlycloudy
2 10927 Subscriber Male 2013-06-30 14:43:00 2013-06-30 15:01:00 1040 Sheffield Ave & Kingsbury St 41.909592 -87.653497 15.0 Dearborn St & Monroe St 41.88132 -87.629521 23.0 73.0 10.0 16.1 -9999.0 mostlycloudy
In [2]:
df = df.sort_values(by="tripduration", ascending=False)
df.head(5)
Out[2]:
trip_id usertype gender starttime stoptime tripduration from_station_name latitude_start longitude_start dpcapacity_start to_station_name latitude_end longitude_end dpcapacity_end temperature visibility wind_speed precipitation events
20998 8385656 Subscriber Male 2015-12-01 19:02:00 2015-12-02 18:59:00 86188 Sheffield Ave & Kingsbury St 41.909592 -87.653497 15.0 Dearborn St & Erie St 41.893992 -87.629318 23.0 39.0 10.0 8.1 -9999.00 cloudy
29440 11229184 Subscriber Female 2016-08-09 16:56:38 2016-08-10 16:40:40 85442 State St & 33rd St 41.834734 -87.625813 11.0 Fort Dearborn Dr & 31st St 41.838556 -87.608218 15.0 87.1 10.0 5.8 -9999.00 partlycloudy
21305 8474696 Subscriber Male 2015-12-14 15:31:00 2015-12-15 14:57:00 84353 Broadway & Sheridan Rd 41.952833 -87.649993 15.0 Broadway & Wilson Ave 41.965485 -87.657238 19.0 48.0 10.0 12.7 0.00 cloudy
2712 1416670 Subscriber Female 2014-04-15 15:56:00 2014-04-16 14:09:00 79988 Clinton St & Washington Blvd 41.883380 -87.641170 31.0 May St & Taylor St 41.869482 -87.655486 15.0 35.1 10.0 6.9 -9999.00 cloudy
21300 8474002 Subscriber Male 2015-12-14 13:10:00 2015-12-15 10:32:00 76910 Kedzie Ave & Lake St 41.884603 -87.706304 19.0 California Ave & Lake St 41.884454 -87.696298 15.0 48.0 10.0 13.8 0.01 cloudy

Frekvencijska analiza

Upotrebom funkcije value_counts možemo da proverimo koliko se puta koji podatak pojavljuje u nizu. U sledećem primeru prikazaćemo kako da prebrojimo koliko je žena, a koliko muškaraca u datom setu podataka:

In [3]:
frekv = df['gender'].value_counts()
frekv
Out[3]:
Male      37654
Female    12435
Name: gender, dtype: int64
In [4]:
frekv.index # vraća listu uočenih vrednosti
Out[4]:
Index(['Male', 'Female'], dtype='object')
In [5]:
frekv.values # vraća frekvencije za vrednosti iz gornje liste
Out[5]:
array([37654, 12435])

Primer: Učestalost vremenskih prilika

In [6]:
frekv = df['events'].value_counts()
frekv
Out[6]:
partlycloudy    16998
mostlycloudy    15096
cloudy          12075
clear            2818
rain             1828
snow              466
hazy              348
tstorms           318
fog               122
sleet              16
unknown             4
Name: events, dtype: int64

Polna struktura grupe prikazana pie dijagramom

In [7]:
import matplotlib.pyplot as plt

frekv = df['gender'].value_counts()
plt.figure(figsize=(6,6))
plt.pie(frekv.values, labels=frekv.index)
plt.title("Polna struktura grupe")
plt.show()
plt.close()