pandas 之groupby 函数

Pandas是Python里面专门用于数据分析的工具包。个人还蛮推荐这本e-book的Python for Data Analysis

1
2
3
4
5
6
7
8
import pandas as pd

ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)

1
print(df)
      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
2   Devils     2  2014     863
3   Devils     3  2015     673
4    Kings     3  2014     741
5    kings     4  2015     812
6    Kings     1  2016     756
7    Kings     1  2017     788
8   Riders     2  2016     694
9   Royals     4  2014     701
10  Royals     1  2015     804
11  Riders     2  2017     690
1
print(df.groupby('Team'))
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x10be6c470>
1
print(df.groupby('Team').groups) # 查看分组情况
{'Devils': Int64Index([2, 3], dtype='int64'), 'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')}
1
2
3
4
5
grouped = df.groupby('Team')
# 迭代遍历分组
for name_team,group in grouped:
print(name_team)
print(group)
Devils
     Team  Rank  Year  Points
2  Devils     2  2014     863
3  Devils     3  2015     673
Kings
    Team  Rank  Year  Points
4  Kings     3  2014     741
6  Kings     1  2016     756
7  Kings     1  2017     788
Riders
      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
8   Riders     2  2016     694
11  Riders     2  2017     690
Royals
      Team  Rank  Year  Points
9   Royals     4  2014     701
10  Royals     1  2015     804
kings
    Team  Rank  Year  Points
5  kings     4  2015     812
1
print(grouped.get_group('Kings'))
    Team  Rank  Year  Points
4  Kings     3  2014     741
6  Kings     1  2016     756
7  Kings     1  2017     788
1
2
import numpy as np
print(grouped['Points'].agg([np.mean, np.sum]))
              mean   sum
Team                    
Devils  768.000000  1536
Kings   761.666667  2285
Riders  762.250000  3049
Royals  752.500000  1505
kings   812.000000   812
1
2
# 查看每个分组的大小
print(grouped.agg(np.size))
        Rank  Year  Points
Team                      
Devils     2     2       2
Kings      3     3       3
Riders     4     4       4
Royals     2     2       2
kings      1     1       1
1
2
3
# filter some data
filter = df.groupby('Team').filter(lambda x: len(x) >= 3)
print(filter)
      Team  Rank  Year  Points
0   Riders     1  2014     876
1   Riders     2  2015     789
4    Kings     3  2014     741
6    Kings     1  2016     756
7    Kings     1  2017     788
8   Riders     2  2016     694
11  Riders     2  2017     690
1
2
# conclude
# df.groupby 主要用于分割对象,应用函数「聚合,转换,过滤」等
---------------------- 本文结束----------------------