[SPARK-48381][K8S][DOCS] Update `YuniKorn` docs with v1.5.1 ### What changes were proposed in this pull request? This PR aims to update `YuniKorn` docs with v1.5.1 for Apache Spark 4.0.0. ### Why are the changes needed? Apache YuniKorn v1.5.1 was released on 2024-05-16 with 18 bug fixes. - https://yunikorn.apache.org/release-announce/1.5.1 - Locking fixes to avoid existing and potential deadlocks (YUNIKORN-2521, YUNIKORN-2544) - Deadlock detection (YUNIKORN-2539) I installed YuniKorn v1.5.1 on K8s 1.29 and tested manually. **K8s v1.29** ``` $ kubectl version Client Version: v1.30.1 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.2 ``` **YuniKorn v1.5.1** ``` $ helm list -n yunikorn NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION yunikorn yunikorn 1 2024-05-21 10:15:40.207847 -0700 PDT deployed yunikorn-1.5.1 ``` ``` $ build/sbt -Pkubernetes -Pkubernetes-integration-tests -Dspark.kubernetes.test.deployMode=docker-desktop "kubernetes-integration-tests/testOnly *.YuniKornSuite" -Dtest.exclude.tags=minikube,local,decom,r -Dtest.default.exclude.tags= ... [info] YuniKornSuite: [info] - SPARK-42190: Run SparkPi with local[*] (6 seconds, 787 milliseconds) [info] - Run SparkPi with no resources (9 seconds, 869 milliseconds) [info] - Run SparkPi with no resources & statefulset allocation (9 seconds, 815 milliseconds) [info] - Run SparkPi with a very long application name. (8 seconds, 765 milliseconds) [info] - Use SparkLauncher.NO_RESOURCE (9 seconds, 783 milliseconds) [info] - Run SparkPi with a master URL without a scheme. (9 seconds, 836 milliseconds) [info] - Run SparkPi with an argument. (9 seconds, 790 milliseconds) [info] - Run SparkPi with custom labels, annotations, and environment variables. (9 seconds, 880 milliseconds) [info] - All pods have the same service account by default (9 seconds, 802 milliseconds) [info] - Run extraJVMOptions check on driver (6 seconds, 138 milliseconds) [info] - SPARK-42474: Run extraJVMOptions JVM GC option check - G1GC (5 seconds, 906 milliseconds) [info] - SPARK-42474: Run extraJVMOptions JVM GC option check - Other GC (5 seconds, 898 milliseconds) [info] - SPARK-42769: All executor pods have SPARK_DRIVER_POD_IP env variable (8 seconds, 783 milliseconds) [info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (10 seconds, 423 milliseconds) [info] - Run SparkPi with env and mount secrets. (12 seconds, 95 milliseconds) [info] - Run PySpark on simple pi.py example (9 seconds, 881 milliseconds) [info] - Run PySpark to test a pyfiles example (10 seconds, 951 milliseconds) [info] - Run PySpark with memory customization (9 seconds, 815 milliseconds) [info] - Run in client mode. (4 seconds, 279 milliseconds) [info] - Start pod creation from template (8 seconds, 856 milliseconds) [info] - SPARK-38398: Schedule pod creation from template (8 seconds, 912 milliseconds) [info] - A driver-only Spark job with a tmpfs-backed localDir volume (6 seconds, 269 milliseconds) [info] - A driver-only Spark job with a tmpfs-backed emptyDir data volume (6 seconds, 401 milliseconds) [info] - A driver-only Spark job with a disk-backed emptyDir volume (5 seconds, 787 milliseconds) [info] - A driver-only Spark job with an OnDemand PVC volume (5 seconds, 926 milliseconds) [info] - A Spark job with tmpfs-backed localDir volumes (8 seconds, 525 milliseconds) [info] - A Spark job with two executors with OnDemand PVC volumes (10 seconds, 206 milliseconds) [info] - PVs with local hostpath storage on statefulsets !!! CANCELED !!! (2 milliseconds) [info] - PVs with local hostpath and storageClass on statefulsets !!! IGNORED !!! [info] - PVs with local storage !!! CANCELED !!! (0 milliseconds) [info] Run completed in 6 minutes, 37 seconds. [info] Total number of tests run: 27 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 27, failed 0, canceled 2, ignored 1, pending 0 [info] All tests passed. [success] Total time: 414 s (06:54), completed May 21, 2024, 10:28:59 AM ``` ``` $ kubectl describe pod -l spark-role=driver -n spark-8a4e575d62cf4634b025601b3d0d4409 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduling 2s yunikorn spark-8a4e575d62cf4634b025601b3d0d4409/spark-test-app-f04cff44b1dd48b4b2aac7b0e1a74b12-driver is queued and waiting for allocation Normal Scheduled 2s yunikorn Successfully assigned spark-8a4e575d62cf4634b025601b3d0d4409/spark-test-app-f04cff44b1dd48b4b2aac7b0e1a74b12-driver to node docker-desktop Normal PodBindSuccessful 2s yunikorn Pod spark-8a4e575d62cf4634b025601b3d0d4409/spark-test-app-f04cff44b1dd48b4b2aac7b0e1a74b12-driver is successfully bound to node docker-desktop Normal Pulled 1s kubelet Container image "docker.io/kubespark/spark:dev" already present on machine Normal Created 1s kubelet Created container spark-kubernetes-driver Normal Started 1s kubelet Started container spark-kubernetes-driver ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46690 from dongjoon-hyun/SPARK-48381. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
3 месяца назад История
README.md

Apache Spark

Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

https://spark.apache.org/

GitHub Actions Build PySpark Coverage PyPI Downloads

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

./build/mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.)

More detailed documentation is available from the project site, at “Building Spark”.

For general development tips, including info on developing Spark using an IDE, see “Useful Developer Tools”.

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1,000,000,000:

scala> spark.range(1000 * 1000 * 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1,000,000,000:

>>> spark.range(1000 * 1000 * 1000).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be spark:// URL, “yarn” to run on YARN, and “local” to run locally with one thread, or “local[N]” to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run tests for a module, or individual tests.

There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at “Specifying the Hadoop Version and Enabling YARN” for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.

Configuration

Please refer to the Configuration Guide in the online documentation for an overview on how to configure Spark.

Contributing

Please review the Contribution to Spark guide for information on how to get started contributing to the project.

Описание

Apache Spark - A unified analytics engine for large-scale data processing

Конвейеры
0 успешных
0 с ошибкой