本文共 1849 字,大约阅读时间需要 6 分钟。
聚合函数是将多行变成一行
开窗函数是将一行变成多行
聚合开窗函数 : 聚合函数over 这里的选择可以是partition by 子句,但不可以是order by子句
排序开窗函数 : 排序函数over 这里的选项可以是order by 子句,可以是 over(partition by 子句 order by 子句), 但不可以是 partition by 子句。
//导包import org.apache.spark.sql.SparkSession//创建class 样例类case class Score(name: String, clazz: Int, score: Int)//编写代码// 1.实例SparkSession val spark = SparkSession.builder().master("local[*]").appName("sql").getOrCreate()//2.根据SparkSession获取sparkContext 上下文对象 val sc = spark.sparkContext//3.导入隐式类 import spark.implicits._//4.添加创建RDD并添加数据 val scoreDF = sc.makeRDD(Array( Score("a1", 1, 80), Score("a2", 1, 78), Score("a3", 1, 95), Score("a4", 2, 74), Score("a5", 2, 92), Score("a6", 3, 99), Score("a7", 3, 99), Score("a8", 3, 45), Score("a9", 3, 55), Score("a10", 3, 78), Score("a11", 3, 100))).toDF("name", "clazz", "score")//5.注册表 scoreDF.createOrReplaceTempView("scores")//6.数据展示 scoreDF.show()
spark.sql("select * , count(name) over() name_count from scores").show()// 按照班级进行分组spark.sql("select * , count(name) over(partition by clazz) name_count from scores").show()
spark.sql("select * , row_number() over(order by score) rank from scores").show()//先分区然后在排序 spark.sql("select * , row_number() over(partition by clazz order by score) rank from scores").show()// 跳跃排序spark.sql("select * , rank() over(order by score) rank from scores").show()//先分区在跳跃排序spark.sql("select * , rank() over(partition by clazz order by score) rank from scores").show()// 连续排序spark.sql("select * , dense_rank() over( order by score) rank from scores").show()//先分区在连续排序spark.sql("select * , dense_rank() over(partition by clazz order by score) rank from scores").show()// 分组将数据分为N个组spark.sql("select * ,ntile(6) over( order by score) rank from scores").show()
转载地址:http://jokzi.baihongyu.com/