Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Testing Your Apache Spark Apps
... or How to Farm Reputation on stack overflow
STL Big Data - Innovation, Data Engineering, Analytics Group
May 5, 2021
About Me
Kit Menke is the newest organizer of the STL Big
Data IDEA meetup and the Practice Director for
Data Engineering at 1904labs.
We’re hiring!
https://1904labs.com/your-careers/
Insert
Image
● Testing Theory
● Testing is Hard
● Why Test?
● An Example Spark App Testing Setup
● Stack Overflow
Agenda
Testing Theory
Types of Software Tests
Verifies the smallest
testable parts of an
application.
Purpose
Verifies methods and/or
the smallest testable unit.
Unit
Verify the interactions and
connectivity between the
modules of the
application.
Purpose
Ensure different
components work
together.
Integration
Validates the complete
and fully integrated
software product.
Purpose
Evaluate the end-to-end
system.
System
Regression Tests
Testing Pyramid
System
Tests
Integration Tests
Unit Tests
Isolated
Isolation
Faster
Speed
Tests
Run
Slower
Fully integrated
Most of your tests should be unit tests!
Volume
● Utilize expected production throughputs to establish the impact on transaction times of
estimated volumes of transactions and users.
Rendezvous tests
● Test the application’s performance while subjected to concurrency issues under production
load and volume.
Stress Tests
● Subject the application to unrealistically high volumes of users accessing the system at the
same time in order to determine a system breaking point.
Soak Tests
● An extended period of testing at predicted business volumes in order to determine if system
performance degrades during a period of continuous usage.
Performance Tests
Basic checks
● How much data are you getting into your data pipeline?
● How much data is coming out of your pipeline?
● Does the schema look right?
Detailed checks
● Are the data types correct?
● Valid values?
○ Distinct values?
○ Ranges?
○ Correct distribution of values? Ex: a lot of null values
Data Validation
Testing is Hard
● General
○ People often disagree on what each type of test is.
○ Unreasonable metrics like code coverage.
○ Focus on manual testing instead of automated CI pipelines
● Unit testing
○ Testing that only check for the absence of errors, not functionality.
○ Testing the wrong thing - symptom of this is mocking everything
● Integration and system tests
○ Can be brittle - prone to breaking and require constant updates
○ Can be difficult to debug - where is the issue?
● Performance tests
○ Tests aren’t repeatable - the size and shape of your data matters!
○ Results should be comparable over time
Things That Go Wrong
Discussion: why test?
Why: Error Signal Collapse
Static
Analyses
Unit
Tests
Integration
Tests
System
Tests
Performance
Tests
Other
Tests
Mutation
Testing
1. Prevent bugs from getting into production
2. Allow developers to make changes more confidently/quickly
Why test?
● Align with your team on what tests should look like
● Testing Spark Apps requires your full attention
○ Often many dependencies on other data stores (ex: hdfs, hive, hbase, databases)
○ Test your logic, not spark or the dependencies
○ Use pull requests (PR) to review the lack of tests or bad tests
● Start small
○ Bottom (of the pyramid) up - unit tests first
○ Focus on tests that provide the most value
● First priority: run unit tests and build app in a CI pipeline, automatically on PR
● Bug in production? Reproduce it in a unit test first.
Advice for Testing Spark Apps
Spark App Testing
● Project Management
○ Maven (for those of us coming from Java it is the familiar tool)
○ Alternatives: sbt
● Spark
○ Version 3.0.1, choose the same version as your cluster
● Unit testing
○ Scalatest, Scalamock
○ Alternatives: JUnit, TestNG
● CI Pipeline
○ Jenkins
○ Alternatives: Github actions, AWS Codebuild
Spark App Testing Stack (MVP)
● Integration testing
○ Scalatest + Testcontainers
○ Alternatives: scalamock
● System testing
○ Scripts
○ Alternatives: java projects
● Performance testing
○ Re-use system test scripts… just with a lot more data
○ Some way to save results (can just be logs!)
● Helpers
○ Spark-testing-base - Base classes to setup/tear down local spark context
○ Test-containers - use docker containers inside scalatest
Spark App Testing Stack (expanded)
Demo Example Project
And now for something completely
different...
Stack Overflow is a question and answer site for professional and enthusiast programmers.
It's built and run by you as part of the Stack Exchange network of Q&A sites. With your help,
we're working together to build a library of detailed answers to every question about
programming.
https://stackoverflow.com/tour
● Gain reputation by asking and answering questions
● Stack overflow spawned a network of other Q&A sites
○ Ex: Server Fault, Super User, Ask Ubuntu, Math, English, Arqade
Stack Overflow
How do you find the answers to your questions? Usually by typing things into Google and
end up at Stack overflow...
● How do you ask a good question?
○ Write a title that summarizes the specific problem
○ Introduce the problem before you post any code
○ Help others reproduce the problem
○ Respond to feedback
Stack Overflow
Unclear
Asking for an opinion
Too large
● Spark (data?) questions are HARD to ask
○ What is your input?
○ What code do you have?
○ What is the expected output?
○ What the heck are you trying to do?
Data Questions
● Use your local development environment to help answer questions!
● Once you’ve found a question you think you can answer, create a unit test
Cultivate your Spark Skills (and reputation!)
Tip #1: Use parallelize to create test data
Tip #2: Use printSchema and show to check
your work
Iterate quickly by re-running your test!
○ Get at core functionality
○ Small, self-contained units
○ Easy to for someone else to understand
○ Helps others!
Good questions are like unit tests
Get out there, write more tests, and
give back to your community.
Example Spark project with unit tests https://github.com/kitmenke/spark-hello-world
Scalatest https://www.scalatest.org/
spark-testing-base https://github.com/holdenk/spark-testing-base
Testcontainers https://www.testcontainers.org/
Test Pyramid https://martinfowler.com/articles/practical-test-pyramid.html
Monitoring https://www.ibm.com/garage/method/practices/manage/golden-signals
STIL IDEA Meetup Talk Ideas
https://docs.google.com/document/d/1x19Kh7OATI1zbCzrvomIf7ffG5OqbBGwt8MFYLahjgE/
edit
Links

More Related Content

May 2021 Spark Testing ... or how to farm reputation on StackOverflow

  • 1. Testing Your Apache Spark Apps ... or How to Farm Reputation on stack overflow STL Big Data - Innovation, Data Engineering, Analytics Group May 5, 2021
  • 2. About Me Kit Menke is the newest organizer of the STL Big Data IDEA meetup and the Practice Director for Data Engineering at 1904labs. We’re hiring! https://1904labs.com/your-careers/ Insert Image
  • 3. ● Testing Theory ● Testing is Hard ● Why Test? ● An Example Spark App Testing Setup ● Stack Overflow Agenda
  • 5. Types of Software Tests Verifies the smallest testable parts of an application. Purpose Verifies methods and/or the smallest testable unit. Unit Verify the interactions and connectivity between the modules of the application. Purpose Ensure different components work together. Integration Validates the complete and fully integrated software product. Purpose Evaluate the end-to-end system. System Regression Tests
  • 6. Testing Pyramid System Tests Integration Tests Unit Tests Isolated Isolation Faster Speed Tests Run Slower Fully integrated Most of your tests should be unit tests!
  • 7. Volume ● Utilize expected production throughputs to establish the impact on transaction times of estimated volumes of transactions and users. Rendezvous tests ● Test the application’s performance while subjected to concurrency issues under production load and volume. Stress Tests ● Subject the application to unrealistically high volumes of users accessing the system at the same time in order to determine a system breaking point. Soak Tests ● An extended period of testing at predicted business volumes in order to determine if system performance degrades during a period of continuous usage. Performance Tests
  • 8. Basic checks ● How much data are you getting into your data pipeline? ● How much data is coming out of your pipeline? ● Does the schema look right? Detailed checks ● Are the data types correct? ● Valid values? ○ Distinct values? ○ Ranges? ○ Correct distribution of values? Ex: a lot of null values Data Validation
  • 10. ● General ○ People often disagree on what each type of test is. ○ Unreasonable metrics like code coverage. ○ Focus on manual testing instead of automated CI pipelines ● Unit testing ○ Testing that only check for the absence of errors, not functionality. ○ Testing the wrong thing - symptom of this is mocking everything ● Integration and system tests ○ Can be brittle - prone to breaking and require constant updates ○ Can be difficult to debug - where is the issue? ● Performance tests ○ Tests aren’t repeatable - the size and shape of your data matters! ○ Results should be comparable over time Things That Go Wrong
  • 12. Why: Error Signal Collapse Static Analyses Unit Tests Integration Tests System Tests Performance Tests Other Tests Mutation Testing
  • 13. 1. Prevent bugs from getting into production 2. Allow developers to make changes more confidently/quickly Why test?
  • 14. ● Align with your team on what tests should look like ● Testing Spark Apps requires your full attention ○ Often many dependencies on other data stores (ex: hdfs, hive, hbase, databases) ○ Test your logic, not spark or the dependencies ○ Use pull requests (PR) to review the lack of tests or bad tests ● Start small ○ Bottom (of the pyramid) up - unit tests first ○ Focus on tests that provide the most value ● First priority: run unit tests and build app in a CI pipeline, automatically on PR ● Bug in production? Reproduce it in a unit test first. Advice for Testing Spark Apps
  • 16. ● Project Management ○ Maven (for those of us coming from Java it is the familiar tool) ○ Alternatives: sbt ● Spark ○ Version 3.0.1, choose the same version as your cluster ● Unit testing ○ Scalatest, Scalamock ○ Alternatives: JUnit, TestNG ● CI Pipeline ○ Jenkins ○ Alternatives: Github actions, AWS Codebuild Spark App Testing Stack (MVP)
  • 17. ● Integration testing ○ Scalatest + Testcontainers ○ Alternatives: scalamock ● System testing ○ Scripts ○ Alternatives: java projects ● Performance testing ○ Re-use system test scripts… just with a lot more data ○ Some way to save results (can just be logs!) ● Helpers ○ Spark-testing-base - Base classes to setup/tear down local spark context ○ Test-containers - use docker containers inside scalatest Spark App Testing Stack (expanded)
  • 19. And now for something completely different...
  • 20. Stack Overflow is a question and answer site for professional and enthusiast programmers. It's built and run by you as part of the Stack Exchange network of Q&A sites. With your help, we're working together to build a library of detailed answers to every question about programming. https://stackoverflow.com/tour ● Gain reputation by asking and answering questions ● Stack overflow spawned a network of other Q&A sites ○ Ex: Server Fault, Super User, Ask Ubuntu, Math, English, Arqade Stack Overflow
  • 21. How do you find the answers to your questions? Usually by typing things into Google and end up at Stack overflow... ● How do you ask a good question? ○ Write a title that summarizes the specific problem ○ Introduce the problem before you post any code ○ Help others reproduce the problem ○ Respond to feedback Stack Overflow Unclear Asking for an opinion Too large
  • 22. ● Spark (data?) questions are HARD to ask ○ What is your input? ○ What code do you have? ○ What is the expected output? ○ What the heck are you trying to do? Data Questions
  • 23. ● Use your local development environment to help answer questions! ● Once you’ve found a question you think you can answer, create a unit test Cultivate your Spark Skills (and reputation!) Tip #1: Use parallelize to create test data Tip #2: Use printSchema and show to check your work Iterate quickly by re-running your test!
  • 24. ○ Get at core functionality ○ Small, self-contained units ○ Easy to for someone else to understand ○ Helps others! Good questions are like unit tests Get out there, write more tests, and give back to your community.
  • 25. Example Spark project with unit tests https://github.com/kitmenke/spark-hello-world Scalatest https://www.scalatest.org/ spark-testing-base https://github.com/holdenk/spark-testing-base Testcontainers https://www.testcontainers.org/ Test Pyramid https://martinfowler.com/articles/practical-test-pyramid.html Monitoring https://www.ibm.com/garage/method/practices/manage/golden-signals STIL IDEA Meetup Talk Ideas https://docs.google.com/document/d/1x19Kh7OATI1zbCzrvomIf7ffG5OqbBGwt8MFYLahjgE/ edit Links