{ "cells": [ { "cell_type": "markdown", "id": "12d609f1-5216-4a08-b205-1d8071e30ea6", "metadata": {}, "source": [ "# Aggregation\n", "\n", "所谓聚合计算就是将数据按照某种方式聚合成组, 在每一组内部进行某种计算, 然后再根据组进行汇总. 例如: 把订单数据按照日期汇总, 计算每天的总销售额. 简单来说就是 SQL 中的 GROUP BY.\n", "\n", "在 Spark 中聚合计算的语法非常灵活以及强大." ] }, { "cell_type": "code", "execution_count": 1, "id": "f7460206-7d3e-4874-828b-5d66c77b4de8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Current directory: /home/jovyan/docs/source/02-Aggregation\n", "Current Python version: 3.10.5\n", "Current Python interpreter: /opt/conda/bin/python\n", "Current site-packages: ['/opt/conda/lib/python3.10/site-packages']\n" ] } ], "source": [ "import os\n", "import sys\n", "import site\n", "\n", "cwd = os.getcwd()\n", "print(f\"Current directory: {cwd}\")\n", "print(f\"Current Python version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}\")\n", "print(f\"Current Python interpreter: {sys.executable}\")\n", "print(f\"Current site-packages: {site.getsitepackages()}\")\n", "\n", "sys.path.append(os.path.join(cwd, \"site-packages\"))" ] }, { "cell_type": "code", "execution_count": 38, "id": "9b8aa5fd-2d40-4cf9-85d8-c4df81ac4a0f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
SparkSession - in-memory
\n", " \n", "SparkContext
\n", "\n", " \n", "\n", "v3.3.0
local[*]
pyspark-shell