Pandas Pivot Table

Pandas 中的 pivot_table() 函数允许我们创建类似电子表格的数据透视表，从而更轻松地对数据进行分组和分析。

Working of pivot table operation in Pandas — Pandas 中的数据透视表操作

让我们看一个例子。

import pandas as pd

# create a dataframe
data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
        'Temperature': [32, 75, 30, 77]}
df = pd.DataFrame(data)

print("Original DataFrame\n", df)
print()

# pivot the dataframe
pivot_df = df.pivot_table(index='Date', columns='City', values='Temperature')

print("Reshaped DataFrame\n", pivot_df)

输出

Original DataFrame
          Date         City  Temperature
0  2023-01-01     New York           32
1  2023-01-01  Los Angeles           75
2  2023-01-02     New York           30
3  2023-01-02  Los Angeles           77

Reshaped DataFrame
 City        Los Angeles  New York
Date                             
2023-01-01           75        32
2023-01-02           77        30

在此示例中，我们将 DataFrame 重塑为以 Date 作为 index，以 City 作为 columns，以 Temperature 作为 values。

pivot_df DataFrame 是一个多维表，显示基于城市和日期的温度。

因此，pivot_table() 操作重塑了数据，使其更清晰以便进一步分析。

pivot_table() 语法

Pandas 中 pivot_table() 的语法是：

df.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, dropna=True)

这里，

index：用作行标签的列
columns：将被重塑为列的列
values：用于新 DataFrame 值的列
aggfunc：用于聚合的函数，默认为 'mean'
fill_value：用于替换缺失值的变量
dropna：是否排除所有条目均为 NaN 的列

示例：使用多个值进行 pivot_table()

如果我们在 pivot_table() 中省略 values 参数，它将选择所有剩余的列（除了指定的 index 和 columns）作为透视表的值。

import pandas as pd

# create a dataframe
data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
        'Temperature': [32, 75, 30, 77],
        'Humidity': [80, 10, 85, 5]}

df = pd.DataFrame(data)

print('Original DataFrame')
print(df)
print()

# pivot the dataframe
pivot_df = df.pivot_table(index='Date', columns='City')

print('Reshaped DataFrame')
print(pivot_df)

输出

Original DataFrame
         Date         City  Temperature  Humidity
0  2023-01-01     New York           32        80
1  2023-01-01  Los Angeles           75        10
2  2023-01-02     New York           30        85
3  2023-01-02  Los Angeles           77         5

Reshaped DataFrame
              Humidity          Temperature         
City       Los Angeles New York Los Angeles New York
Date                                                
2023-01-01          10       80          75       32
2023-01-02           5       85          77       30

在此示例中，我们为多个值（即 Temperature 和 Humidity）创建了一个数据透视表。

使用聚合函数进行 pivot_table()

我们可以使用 aggfunc 参数将 pivot_table() 方法与不同的聚合函数一起使用。我们可以将 aggfunc 的值设置为诸如 'sum'、'mean'、'count'、'max' 或 'min' 之类的函数。

让我们看一个例子。

import pandas as pd

data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03', '2023-01-03'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York', 'Los Angeles'],
        'Temperature': [32, 75, 30, 77, 33, 78],
        'Humidity': [80, 10, 85, 5, 81, 7]}

df = pd.DataFrame(data)

# calculate mean temperature for each city using pivot_table()
mean_temperature = df.pivot_table(index='City', values='Temperature', aggfunc='mean')

print(mean_temperature)

输出

Temperature
City                    
Los Angeles    76.666667
New York       31.666667

在上面的示例中，我们使用 pivot_table() 中的 aggfunc='mean' 参数计算了每个城市的平均温度。

带有 MultiIndex 的数据透视表

我们可以使用 pivot_table() 函数创建带有 MultiIndex 的数据透视表。

让我们看一个例子。

import pandas as pd

# create a dataframe
data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles','Delhi', 'Chennai', 'Delhi', 'Chennai'],
        'Country': ['USA', 'USA', 'USA', 'USA', 'India', 'India', 'India', 'India'],
        'Temperature': [32, 75, 30, 77, 75, 80, 78, 79]}
df = pd.DataFrame(data)

print("Original DataFrame\n", df)
print()

# create a pivot table with multiindex
pivot_df = df.pivot_table(index=['Country', 'City'], columns='Date', values='Temperature')

print("Reshaped DataFrame\n", pivot_df)

输出

Original DataFrame
          Date         City Country  Temperature
0  2023-01-01     New York     USA           32
1  2023-01-01  Los Angeles     USA           75
2  2023-01-02     New York     USA           30
3  2023-01-02  Los Angeles     USA           77
4  2023-01-01        Delhi   India           75
5  2023-01-01      Chennai   India           80
6  2023-01-02        Delhi   India           78
7  2023-01-02      Chennai   India           79

Reshaped DataFrame
 Date                 2023-01-01  2023-01-02
Country City                               
India   Chennai              80          79
        Delhi                75          78
USA     Los Angeles          75          77
        New York             32          30

在此示例中，我们通过将列列表作为 index 参数传递，创建了一个带有 MultiIndex 的数据透视表。

MultiIndex 包含多个索引级别，列之间通过父/子关系相互关联。在这里，Country 是父列，City 是子列。

使用 pivot_table() 处理缺失值

有时在使用 pivot_table() 重塑数据时，数据透视表中可能会出现缺失值。此类缺失值或 NaN 值可以通过 fill_value 和 dropna 参数在 pivot_table() 操作中进行处理。

dropna 参数指定是否删除所有条目均为 NaN 的列。dropna 的默认值为 True。

让我们看一个例子。

import pandas as pd
import numpy as np

# Creating the DataFrame
data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03', '2023-01-03', '2023-01-03'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York', 'Los Angeles', 'Chicago'],
        'Temperature': [32, 75, 30, 77, np.nan, 76, np.nan]}
df = pd.DataFrame(data)

# create a pivot table
pivot_df = df.pivot_table(index='Date', columns='City', values='Temperature')

print("\nDefault Pivot Table\n", pivot_df)

# create a pivot table with dropna=True
pivot_df_dropna = df.pivot_table(index='Date', columns='City', values='Temperature', dropna=False)

print("\nPivot Table with dropna=False:\n", pivot_df_dropna)

输出

Default Pivot Table
 City        Los Angeles  New York
Date                             
2023-01-01         75.0      32.0
2023-01-02         77.0      30.0
2023-01-03         76.0       NaN

Pivot Table with dropna=False:
 City        Chicago  Los Angeles  New York
Date                                      
2023-01-01      NaN         75.0      32.0
2023-01-02      NaN         77.0      30.0
2023-01-03      NaN         76.0       NaN

在此示例中，我们使用 dropna 函数来确定如何处理完全为 NaN 条目的列。默认情况下，dropna 参数设置为 True，导致 Chicago 列被自动删除。

请注意，New York 列未被删除，尽管它有一个 NaN 值。这是因为 dropna 会删除所有条目均为 NaN 的列。

另一方面，fill_value 参数将所有 NaN 值替换为指定的值。例如，

import pandas as pd
import numpy as np

# Creating the DataFrame
data = {'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-03', '2023-01-03'],
        'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York', 'Los Angeles'],
        'Temperature': [32, np.nan, 30, 77, np.nan, 76]}
df = pd.DataFrame(data)

# create a pivot table
pivot_df = df.pivot_table(index='Date', columns='City', values='Temperature')

print("\nDefault Pivot Table\n", pivot_df)

# create a pivot table with fill_value=0
pivot_df_dropna = df.pivot_table(index='Date', columns='City', values='Temperature', fill_value=0)

print("\nPivot Table with fill_value=0:\n", pivot_df_dropna)

输出

Default Pivot Table
 City        Los Angeles  New York
Date                             
2023-01-01          NaN      32.0
2023-01-02         77.0      30.0
2023-01-03         76.0       NaN

Pivot Table with fill_value=0:
 City        Los Angeles  New York
Date                             
2023-01-01            0        32
2023-01-02           77        30
2023-01-03           76         0

在此示例中，我们使用 fill_value=0 参数将 NaN 值替换为 0。

pivot() 与 pivot_table() 的区别

pivot() 和 pivot_table() 函数执行类似的操作，但有一些关键区别。

依据	pivot()	pivot_table()
聚合	不允许数据聚合。	允许聚合（求和、平均值、计数等）。
重复索引	无法处理重复的索引值。	可以处理重复的索引值。
MultiIndex	仅接受单级索引。	接受多级索引以处理复杂数据。

热门教程

热门实例

参考资料

认证课程

成为一名认证的 Python
程序员。

热门教程

参考资料

热门实例

简介

DataFrame 操作和处理

数据导入和导出

数据清洗

数据分析和聚合

数据可视化

Pandas Pivot Table

pivot_table() 语法

示例：使用多个值进行 pivot_table()

使用聚合函数进行 pivot_table()

带有 MultiIndex 的数据透视表

使用 pivot_table() 处理缺失值

pivot() 与 pivot_table() 的区别

目录

热门教程

热门实例

参考资料

认证课程

成为一名认证的 Python程序员。

热门教程

参考资料

热门实例

简介

DataFrame 操作和处理

数据导入和导出

数据清洗

数据分析和聚合

数据可视化

Pandas Pivot Table

pivot_table() 语法

示例：使用多个值进行 pivot_table()

使用聚合函数进行 pivot_table()

带有 MultiIndex 的数据透视表

使用 pivot_table() 处理缺失值

pivot() 与 pivot_table() 的区别

目录

成为一名认证的 Python
程序员。