20、Hive 实战 - Hive的函数之窗口函数

1. 窗口函数概述

1.1. 什么是窗口函数

  • 简单理解,就是对查询的结果多出一列,这一列可以是聚合值,也可以是排序值。
  • 开窗函数一般就是说的是over()函数,其窗口是由一个 OVER 子句 定义的多行记录
  • 开窗函数一般分为两类,聚合开窗函数和排序开窗函数。

1.2. 窗口函数的功能

<窗口函数> over (partition by <用于分组的列名> order by <用于排序的列名> [rows between ?? and ???])

  • 同时具有分组和排序的功能
  • 不减少原表的行数

1.3. 窗口函数的种类和位置

  • 专用窗口函数,包括rank, dense_rank, row_number等专用窗口函数
  • 聚合函数,如sum(). avg(), count(), max(), min()等,rows between…and…
  • 因为窗口函数是对where或者group by子句处理后的结果进行操作,所以窗口函数原则上只能写在select子句中

1.4. 窗口函数说明

  • over():是窗口函数的核心,需要开窗的全部都需要使用这个函数
  • partition by:用来对表分组,后续跟上分组的字段,会给该字段中每个不同的值开一个窗(可以添加多个字段)
  • order by:是对分组后的结果进行排序,默认是按照升序(asc)排列,可以进行多个字段降序
  • rows between ?? and ???:用来对窗口中已经排序完成后的上下数据进行筛选

2. 聚合窗口函数

聚合窗口函数就是在通过over开窗的情况下,使用sum、max等聚合函数对开窗之后的结果进行聚合汇总;跟普通的聚合函数不同,普通聚合函数只会生成一条结果数据,聚合窗口函数会在每一行中都新增一列,用了存储这聚合结果,这个窗口有多少条数据,那就会有多少条一模一样的结果数据。

原始数据如下表所示(只列举出了部分数据):

user_id cat_id score
618 488 0.0320793066566
7325 5648 0.1794971885383
7325 5653 0.0195317941826
9549 472 0.2734451185567
9549 316 0.0102923143349
10187 560 0.09287464288
16093 299 0.8203353556701
19109 450 0.0129389094682
29043 5580 0.0780796261011
29043 360 0.0445708866509
30207 301 0.1283038644617
37469 318 0.0672950094793
42121 364 0.0285768131938
45021 476 0.0290241345941
48161 308 1.3997567045733
48161 530 0.0280696255429
53008 510 0.0292976912317
53568 302 0.6823961209357
58397 300 0.0356667545944

SQL编写和查询结果展示:

select
    user_id
    , third_cat_id as cat_id
    , score
    , max(score) over(partition by user_id) max_score_of_user
    , min(score) over(partition by user_id) min_score_of_user
    , count(score) over(partition by user_id) cnt_score_of_user
    , avg(score) over(partition by user_id) avg_score_of_user
    , sum(score) over(partition by user_id) sum_score_of_user
from yishou_recommendation_system.user_third_category_score_15day
user_id cat_id score max_score_of_user min_score_of_user cnt_score_of_user avg_score_of_user sum_score_of_user
1697 325 0.24999999928 0.24999999928 0.24999999928 1 0.24999999928 0.24999999928
13638 401 0.1924408114735 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 318 0.8433564379566 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 295 0.8473076857954 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 299 0.3413452337674 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 556 0.4636178949445 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 472 0.0769230769231 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 314 0.1531243564897 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 363 0.0153345765378 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 431 0.0928746430137 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 303 0.1512020592657 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 316 0.1252590167457 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 490 0.0153345765378 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 301 0.4550003444724 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 488 0.4961183604825 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 300 0.2698680479978 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 361 0.0769230769231 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 493 0.0153345765378 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 521 0.1332110723867 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601
13638 561 0.0769230769231 2.3144764447547 0.0153345765378 29 0.30592698646857247 8.871882607588601

3. 排序窗口函数

3.1. 不同排序窗口函数的区别

  • rank():如果有并列名次的行,会占用下一名次的位置,排序结果是 1,1,3
  • dense_rank():如果有并列名次的行,不占用下一名次的位置,排序结果是 1,1,2
  • row_number():不考虑并列名次的情况,排序结果是 1,2,3

3.2. SQL编写和查询结果展示

select
    user_id
    , third_cat_id as cat_id
    , score
    , rank() over(partition by user_id order by score) score_rank_of_user
    , dense_rank() over(partition by user_id order by score) score_dense_rank_of_user
    , row_number() over(partition by user_id order by score) score_row_number_of_user
from yishou_recommendation_system.user_third_category_score_15day
user_id cat_id score score_rank_of_user score_dense_rank_of_user score_row_number_of_user
1697 325 0.24999999928 1 1 1
13638 493 0.0153345765378 1 1 1
13638 363 0.0153345765378 1 1 2
13638 5657 0.0153345765378 1 1 3
13638 490 0.0153345765378 1 1 4
13638 325 0.0153345765378 1 1 5
13638 340 0.0379776680745 6 2 6
13638 472 0.0769230769231 7 3 7
13638 361 0.0769230769231 7 3 8
13638 561 0.0769230769231 7 3 9
13638 562 0.0769230769231 7 3 10
13638 510 0.0928746430137 11 4 11
13638 431 0.0928746430137 11 4 12
13638 316 0.1252590167457 13 5 13
13638 521 0.1332110723867 14 6 14
13638 394 0.1401607366123 15 7 15
13638 303 0.1512020592657 16 8 16
13638 314 0.1531243564897 17 9 17
13638 401 0.1924408114735 18 10 18
13638 300 0.2698680479978 19 11 19
13638 299 0.3413452337674 20 12 20
13638 304 0.4085346499663 21 13 21
13638 308 0.4296759077854 22 14 22
13638 301 0.4550003444724 23 15 23
13638 556 0.4636178949445 24 16 24
13638 488 0.4961183604825 25 17 25
13638 294 0.4990914022087 26 18 26
13638 318 0.8433564379566 27 19 27
13638 295 0.8473076857954 28 20 28
13638 302 2.3144764447547 29 21 29

4. 排序截取数据函数

4.1. 排序截取数据函数的区别

  • lag(col,n,default_val):获取往前第n行数据,col是列名,n是往上的行数,当第n行为null的时候取default_val
  • lead(col,n, default_val):往后第n行数据,col是列名,n是往下的行数,当第n行为null的时候取default_val
  • ntile(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。
  • cume_dist(),计算某个窗口或分区中某个值的累积分布。假定升序排序,则使用以下公式确定累积分布:小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中,x 等于 order by 子句中指定的列的当前行中的值。

4.2. 原始数据

user_id record_date
11343325 2022-09-04 06:17:24
11343325 2022-09-04 06:17:24
11343326 2022-09-04 06:17:25
11343326 2022-09-04 06:17:25
11343327 2022-09-04 06:17:26
11343327 2022-09-04 06:17:26
11343326 2022-09-04 06:17:27
11343328 2022-09-04 06:17:27
11343328 2022-09-04 06:17:27
11343329 2022-09-04 06:17:28
11343329 2022-09-04 06:17:28
4654923 2022-09-04 06:17:30
11343330 2022-09-04 06:17:32
11343330 2022-09-04 06:17:32
11343331 2022-09-04 06:17:35
11343331 2022-09-04 06:17:35
11343332 2022-09-04 06:17:37
11343332 2022-09-04 06:17:37
11343333 2022-09-04 06:17:37
11343333 2022-09-04 06:17:37
11343334 2022-09-04 06:17:40
11343334 2022-09-04 06:17:40

4.3. lag和lead函数的使用和结果

select
    user_id
    , record_date
    , lag(record_date, 1, '1970-01-01 00:00:00') over(partition by user_id order by record_date) as start_date
    , lead(record_date, 1, '9999-99-99 00:00:00') over(partition by user_id order by record_date) as end_date
from user_table

使用user_id分组,并使用record_date进行升序排序;lag函数会获取之前一行的record_date数据作为start_date字段的值,并回使用 '1970-01-01 00:00:00' 作为默认值;lead函数会获取之后一行的record_date数据作为end_date字段的值,并回使用 '9999-99-99 00:00:00' 作为默认值;结果如下所示:

user_id record_date start_date end_date
1000045 2022-09-03 11:41:50 1970-01-01 00:00:00 9999-99-99 00:00:00
1000061 2022-09-03 15:22:21 1970-01-01 00:00:00 9999-99-99 00:00:00
100013 2022-09-03 08:14:36 1970-01-01 00:00:00 2022-09-03 14:47:43
100013 2022-09-03 14:47:43 2022-09-03 08:14:36 2022-09-03 14:48:25
100013 2022-09-03 14:48:25 2022-09-03 14:47:43 2022-09-03 14:49:11
100013 2022-09-03 14:49:11 2022-09-03 14:48:25 2022-09-03 15:08:41
100013 2022-09-03 15:08:41 2022-09-03 14:49:11 2022-09-03 15:09:27
100013 2022-09-03 15:09:27 2022-09-03 15:08:41 9999-99-99 00:00:00

具体赋值如下图所示:

 

5. 排序限制分区函数

5.1. 各排序限制分区函数说明

  • rows必须跟在Order by 子句之后,对排序的结果进行限制,使用固定的行数来限制分区中的数据行数量;
  • OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化;
  • current row:当前行;
  • n preceding:往前n行数据;
  • n following:往后n行数据;
  • unbounded preceding:该窗口数据的起点;
  • unbounded following:该窗口数据的终点;

5.2. SQL编写和结果展示

select
    user_id
    , third_cat_id as cat_id
    , score
    
    -- 以user_id分组,按照score进行升序排序
    , row_number() over(partition by user_id order by score) as row_number_id
    
    -- 以user_id分组,按照score进行升序排序,求该窗口中第一行到当前行的和
    , sum(score) over(partition by user_id order by score rows between unbounded preceding and current row) sum_1
    
    -- 以user_id分组,按照score进行升序排序,求该窗口中之前一行到当前行的和
    , sum(score) over(partition by user_id order by score rows between 1 preceding and current row) sum_2
    
    -- 以user_id分组,按照score进行升序排序,求该窗口中当前行到之后一行的和
    , sum(score) over(partition by user_id order by score rows between current row and 1 following) sum_3
    
    -- 以user_id分组,按照score进行升序排序,求该窗口中当前行到最后一行的和
    , sum(score) over(partition by user_id order by score rows between current row and unbounded following) sum_3
    
from yishou_recommendation_system.user_third_category_score_15day
user_id cat_id score row_number_id sum_1 sum_2 sum_3 sum_3
1697 325 0.24999999928 1 0.24999999928 0.24999999928 0.24999999928 0.24999999928
13638 490 0.0153345765378 1 0.0153345765378 0.0153345765378 0.0306691530756 8.8718826075886
13638 493 0.0153345765378 2 0.0306691530756 0.0306691530756 0.0306691530756 8.8565480310508
13638 363 0.0153345765378 3 0.0460037296134 0.0306691530756 0.0306691530756 8.841213454513
13638 325 0.0153345765378 4 0.0613383061512 0.0306691530756 0.0306691530756 8.8258788779752
13638 5657 0.0153345765378 5 0.076672882689 0.0306691530756 0.053312244612300004 8.8105443014374
13638 340 0.0379776680745 6 0.1146505507635 0.053312244612300004 0.1149007449976 8.7952097248996
13638 561 0.0769230769231 7 0.19157362768660002 0.1149007449976 0.1538461538462 8.7572320568251
13638 472 0.0769230769231 8 0.26849670460970004 0.1538461538462 0.1538461538462 8.680308979902
13638 562 0.0769230769231 9 0.34541978153280006 0.1538461538462 0.1538461538462 8.6033859029789
13638 361 0.0769230769231 10 0.4223428584559001 0.1538461538462 0.1697977199368 8.5264628260558
13638 431 0.0928746430137 11 0.5152175014696001 0.1697977199368 0.1857492860274 8.449539749132699
13638 510 0.0928746430137 12 0.6080921444833001 0.1857492860274 0.21813365975940002 8.356665106119
13638 316 0.1252590167457 13 0.733351161229 0.21813365975940002 0.2584700891324 8.2637904631053
13638 521 0.1332110723867 14 0.8665622336157 0.2584700891324 0.273371808999 8.1385314463596
13638 394 0.1401607366123 15 1.006722970228 0.273371808999 0.29136279587799996 8.0053203739729
13638 303 0.1512020592657 16 1.1579250294937 0.29136279587799996 0.3043264157554 7.8651596373606
13638 314 0.1531243564897 17 1.3110493859834 0.3043264157554 0.3455651679632 7.713957578094901
13638 401 0.1924408114735 18 1.5034901974569 0.3455651679632 0.4623088594713 7.5608332216052005
13638 300 0.2698680479978 19 1.7733582454546999 0.4623088594713 0.6112132817652001 7.368392410131699
13638 299 0.3413452337674 20 2.1147034792221 0.6112132817652001 0.7498798837337 7.098524362133899
13638 304 0.4085346499663 21 2.5232381291883996 0.7498798837337 0.8382105577517001 6.7571791283665
13638 308 0.4296759077854 22 2.9529140369738 0.8382105577517001 0.8846762522578 6.3486444784002
13638 301 0.4550003444724 23 3.4079143814462 0.8846762522578 0.9186182394169 5.918968570614799
13638 556 0.4636178949445 24 3.8715322763907 0.9186182394169 0.959736255427 5.4639682261424
13638 488 0.4961183604825 25 4.3676506368732 0.959736255427 0.9952097626912 5.0003503311979
13638 294 0.4990914022087 26 4.8667420390819 0.9952097626912 1.3424478401653 4.5042319707154
13638 318 0.8433564379566 27 5.7100984770385 1.3424478401653 1.690664123752 4.0051405685067
13638 295 0.8473076857954 28 6.5574061628338995 1.690664123752 3.1617841305501 3.1617841305501
13638 302 2.3144764447547 29 8.8718826075886 3.1617841305501 2.3144764447547 2.3144764447547
16530 314 0.0833333330933 1 0.0833333330933 0.0833333330933 0.1666666661866 1.6825354114854
16530 318 0.0833333330933 2 0.1666666661866 0.1666666661866 0.23571003949270003 1.5992020783921002
16530 493 0.1523767063994 3 0.319043372586 0.23571003949270003 0.3047534127988 1.5158687452988
16530 303 0.1523767063994 4 0.47142007898540006 0.3047534127988 0.3381259921594 1.3634920388994
16530 325 0.18574928576 5 0.6571693647454 0.3381259921594 0.37149857152 1.2111153325
16530 489 0.18574928576 6 0.8429186505054 0.37149857152 1.02536604674 1.02536604674
16530 324 0.83961676098 7 1.6825354114854 1.02536604674 0.83961676098 0.83961676098