pyspark.sql.functions.hll_union_agg¶
-
pyspark.sql.functions.
hll_union_agg
(col: ColumnOrName, allowDifferentLgConfigK: Union[bool, pyspark.sql.column.Column, None] = None) → pyspark.sql.column.Column[source]¶ Aggregate function: returns the updatable binary representation of the Datasketches HllSketch, generated by merging previously created Datasketches HllSketch instances via a Datasketches Union instance. Throws an exception if sketches have different lgConfigK values and allowDifferentLgConfigK is unset or set to false.
New in version 3.5.0.
- Parameters
- col
Column
or str or bool - allowDifferentLgConfigKbool, optional
Allow sketches with different lgConfigK values to be merged (defaults to false).
- col
- Returns
Column
The binary representation of the merged HllSketch.
Examples
>>> df1 = spark.createDataFrame([1,2,2,3], "INT") >>> df1 = df1.agg(hll_sketch_agg("value").alias("sketch")) >>> df2 = spark.createDataFrame([4,5,5,6], "INT") >>> df2 = df2.agg(hll_sketch_agg("value").alias("sketch")) >>> df3 = df1.union(df2).agg(hll_sketch_estimate( ... hll_union_agg("sketch") ... ).alias("distinct_cnt")) >>> df3.drop("sketch").show() +------------+ |distinct_cnt| +------------+ | 6| +------------+ >>> df4 = df1.union(df2).agg(hll_sketch_estimate( ... hll_union_agg("sketch", lit(False)) ... ).alias("distinct_cnt")) >>> df4.drop("sketch").show() +------------+ |distinct_cnt| +------------+ | 6| +------------+ >>> df5 = df1.union(df2).agg(hll_sketch_estimate( ... hll_union_agg(col("sketch"), lit(False)) ... ).alias("distinct_cnt")) >>> df5.drop("sketch").show() +------------+ |distinct_cnt| +------------+ | 6| +------------+