{ "cells": [ { "cell_type": "markdown", "id": "864ea14f-932d-4e4d-b44e-f96b55796d79", "metadata": {}, "source": [ "# Quick Start" ] }, { "cell_type": "code", "execution_count": 18, "id": "e18f42a6-0f5e-456a-adf1-8e0be6125980", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Current directory: /home/jovyan/docs/source/01-Quick-Start\n", "Current Python version: 3.10.5\n", "Current Python interpreter: /opt/conda/bin/python\n", "Current site-packages: ['/opt/conda/lib/python3.10/site-packages']\n" ] } ], "source": [ "import os\n", "import sys\n", "import site\n", "\n", "cwd = os.getcwd()\n", "print(f\"Current directory: {cwd}\")\n", "print(f\"Current Python version: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}\")\n", "print(f\"Current Python interpreter: {sys.executable}\")\n", "print(f\"Current site-packages: {site.getsitepackages()}\")\n", "\n", "sys.path.append(os.path.join(cwd, \"site-packages\"))" ] }, { "cell_type": "markdown", "id": "8d8532af-13bc-4861-91bd-13b3a6d21e31", "metadata": {}, "source": [ "## Spark Session\n", "\n", "Spark 集群通常是跑在服务器上的, 而 Jupyter Notebook 则是提供了一个可以和 Spark 集群交互的界面. [Livy](https://livy.apache.org/) 是一个能提供 Spark API 的软件. 你的 Jupyter Notebook 每次运行 Spark API, 实际上则是向 Livy 发送了一个命令, 并把返回的响应展现在 Notebook 中. 无论你是使用 Jupyter Notebook 还是 spark submit, 你都要创建一个 Session 跟 Spark 集群建立连接, 这个连接就叫做 Spark Session. 当然连接长期没有动作就会失效.\n", "\n", "下面是一个创建 Spark Session 的极简例子. 其实还有很多参数可以用, 详情请参考 [pyspark.sql.SparkSession](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html).\n" ] }, { "cell_type": "code", "execution_count": 19, "id": "6b03e3be-3ae4-4521-b4a7-c27dbb3363d0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
SparkSession - in-memory
\n", " \n", "SparkContext
\n", "\n", " \n", "\n", "v3.3.0
local[*]
pyspark-shell