[B! spark] kimutanskのブックマーク

Introducing Low-latency Continuous Processing Mode in Structured Streaming in Apache Spark 2.3

Unified governance for all data, analytics and AI assets

kimutansk 2018/03/21

Marker仕込んでCheckpoint取得はFlink等と同じと。Sparkの最適化の恩恵がどれほどか・・ただ、現状機能制約が大きいので、機能としてはWindow処理を有するストリーム処理エンジンには及ばないと。

stream
spark

リンク

Outshift | The anatomy of Spark applications on Kubernetes

Get emerging insights on innovative techno logy straight to your inbox. Apache Spark on Kubernetes series: Introduction to Spark on Kubernetes Scaling Spark made simple on Kubernetes The anatomy of Spark applications on Kubernetes Monitoring Apache Spark with Prometheus Spark History Server on Kubernetes Spark scheduling on Kubernetes demystified Spark Streaming Checkpointing on Kubernetes Deep div

kimutansk 2018/03/06

Sparkのk8s上動作環境と。k8sで様々なものが動く前提であれば、そちらに乗せてしまった方が統一できますから、それはそれでいい構成ではあると。

spark
k8s

リンク

Outshift | Monitoring Apache Spark with Prometheus on Kubernetes

kimutansk 2018/03/06

Spark用のPrometheusExporter（要PushGateway）と。この手の、後付けで起動できるかについてはSparkのメトリクス出力機構側の方も確認しておく必要がありますか。

リンク

Introducing Apache Spark 2.3

kimutansk 2018/03/05

Structured StreamingのContinuousモード、現状はRecord単位のオペレーションのみ実行可能ですか。ただし、current系はまだ駄目と。あとはKafka入出力、Console/Memory出力対応と。

spark
stream

リンク

Outshift | Application monitoring with Prometheus and Pipeline

Get emerging insights on innovative techno logy straight to your inbox. Monitoring series: Monitoring Apache Spark with Prometheus Monitoring multiple federated clusters with Prometheus - the secure way Application monitoring with Prometheus and Pipeline Building a cloud cost management system on top of Prometheus Monitoring Spark with Prometheus, reloaded At Banzai Cloud we provision and monitor l

kimutansk 2018/03/05

Spark用PrometheusExporterのラベルマッピング定義例と。サンプルが地味にありがたい。

リンク

Outshift | Spark Streaming Checkpointing on Kubernetes

kimutansk 2018/02/27

kubernetes用個別設定でMetaDataとしてのチェックポイントを確保する方式と。で、Jobから起動させることで特定ホスト障害時に他のところにデプロイされるようになると。

spark

リンク

How to select the first row of each group?

Window functions: Something like this should do the trick: import org.apache.spark.sql.functions.{row_number, max, broadcast} import org.apache.spark.sql.expressions.Window val df = sc.parallelize(Seq( (0,"cat26",30.9), (0,"cat13",22.1), (0,"cat95",19.6), (0,"cat105",1.3), (1,"cat67",28.5), (1,"cat4",26.8), (1,"cat13",12.6), (1,"cat23",5.3), (2,"cat56",39.6), (2,"cat40",29.7), (2,"cat187",27.9), (

kimutansk 2018/01/29

こういう様々な記述パターンが１質問に例と共にまとまっているのは相応にありがたいですね。

spark
sql

リンク

Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East talk by Joy Chakraborty

kimutansk 2017/12/26

Livyのレイヤでproxyuserを介してユーザ偽装してKerberos認証は可能ですが、JupyterHubのユーザ毎に使い分けるのは基本構成は不可と。ユーザ毎にコンテナ起動して使うのがいいのか・・？

リンク

Apache Spark 2.3のCatalystでのコード生成の改善の解説 | Kazuaki Ishizaki

この記事は？この記事は、Distributed computing (Apache Hadoop, Spark, Kafka, …) Advent Calendar 2017の21日目の記事です。この記事の内容は？ 2018年の早い時期にリリース予定のApache Spark 2.3に入る、Catalyst optimizerによって生成されるJavaコードの改善に関するまとめです。結局何がいいたいの？ Spark 2.2までは、DataFrameやDatasetのqueryの中で、複雑な式や多数のカラム、を使うと、実行時に例外が投げられて、運が悪いと実行が止まってしまうのを、よく見ていたこと思います。この例外は、20年以上前に定義されたJavaのクラスファイルの仕様が持つ、２つの64KBの制限、からくるものでした。これらの制限によって起きる例外は、Spark 2.3ではかなり減り

kimutansk 2017/12/22

書かれている内容はわかれど、残念ながらPR中身の詳細はわからず・・どこかでcodegenきちんと確認しないといけませんか。で、1/3現在ではまだマージされず。入りますかね。

リンク

How to define UDAF over event-time windows in PySpark 2.1.0

kimutansk 2017/12/08

リンク先からもわかりますが、やはり2.2系の時点だとSparkの世界のみでPythonでUDAF開発は無理か。aggにカラムと関数渡すことで、基本的な集計はカラム別に可能ではありますが。

spark
python

リンク

Spark Streaming Programming Techniques You Should Know with Gerard Maas

kimutansk 2017/11/07

この手のcheckpoint、Structured StreamだとOffset等の値と処理の前回の結果や中間データは保持できますが、RDD版だとそれ以上に厳密に可能ではありますが、現実的にそういうケースは存在するのか・・・

stream
spark

リンク

Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera

Spark Streaming has supported Kafka since it’s inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself. So a new Spark Streaming integration comes to the playground, wi

kimutansk 2017/11/07

Exactly onceについて、下流システムが冪等か、OffSet更新と同時にTransactionalに行えるかが必要と書いてあるあたりが。好感持てる。セマンティクスごとの設定項目例もありがたい。

リンク

Deep dive into stateful stream processing in structured streaming by Tathagata Das

Deep dive into stateful stream processing in structured streaming by Tathagata Das Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there ar

kimutansk 2017/10/31

全体構成、Spark as a Compilerとしての説明、UDFの使い方、Watermark、各種Join説明、レイテンシとStateのトレードオフなど相当トピック満載ですね。素晴らしい。

stream
spark

リンク

Easy, scalable, fault tolerant stream processing with structured streaming - with Tathagata Das

Easy, scala ble, fault tolerant stream processing with structured streaming - with Tathagata Das Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a

kimutansk 2017/10/31

StreamingQueryListenerと、streamingMetricsは今度入れますかね。format:deltaはparquetファイル出力などの場合発生するギャップへの対処ですか。データストアではなく、こういう形で出してきますか。

stream
spark

リンク

Efficient UD(A)Fs with PySpark

Nowadays, Spark surely is one of the most prevalent techno logies in the fields of data science and big data. Luckily, even though it is developed in Scala and runs in the Java Virtual Machine (JVM), it comes with Python bindings also known as PySpark, whose API was heavily influenced by Pandas. With respect to functionality, modern PySpark has about the same capabilities as Pandas when it comes to