This document discusses handling large amounts of genomic data using probabilistic data structures like Bloom filters. Bloom filters allow storing and querying large amounts of genomic sequence data in a memory-efficient way. They can be used to assemble short DNA sequences, reduce graph complexity, and trim errors from assemblies. The approach works well for pre-filtering large metagenomic datasets, enabling assembly of 200GB datasets using a single machine.
The document discusses clustering and numpy arrays in Python. It shows how to create arrays using numpy, perform operations like summing and finding min/max values, and access elements and slices. It also introduces Cython and demonstrates compiling a simple "Hello World" Cython program and using Cython to optimize a Python prime number generation function for improved performance.
Slides Επιστήμης Δικτύων για υπολογισμούς με την Python στα πλαίσια του μεταπτυχιακού μαθήματος των Ψηφιακών Τεχνολογιών στην Εκπαίδευση του Μαθηματικού Τμήματος του Πανεπιστημίου Πατρών κατά το χειμερινό εξάμηνο 2014-5.
Τα slides αυτά θα γίνονται συνεχώς updated ως το τέλος του εξαμήνου (τέλη Δεκεμβρίου 2014). Η ημερομηνία του update γράφεται στην πρώτη σελίδα των slides.
This document contains information about programming in R, including practical examples. It discusses accessing and subsetting data, using regular expressions for text search, creating functions, and using loops. Examples are provided to demonstrate creating vectors, accessing subsets of vectors, using regular expressions to find patterns in text, creating functions to convert between units or estimate values, and using for loops to repeat operations over multiple elements. The document suggests R is useful for working with big data in biology and other fields due to its ability to automate tasks, integrate with other tools, and handle large datasets through programming.
The document discusses the deque collection in Python. Some key points:
- Deque allows fast appends and pops from either side of the list, with O(1) time complexity, unlike regular lists which are slow (O(n)) for pop(0) and insert(0,v).
- Deque provides methods like append, appendleft, popleft, pop for adding/removing elements from either side of the list.
- It can be initialized with a maximum length to act as a sliding window, discarding old elements as new ones are added.
- Methods like rotate rotate the deque a given number of positions, extending adds multiple elements at once. Deque is useful when
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
If you are struggling to make a plot, tear yourself away from stackoverflow for a moment and ... take a hard look at your data. Is it really in the most favorable form for the task at hand? Time and time again I have found that my visualization struggles are really a symptom of unfinished data wrangling. R has long had excellent facilities for data aggregation or "split-apply-combine": split an object into pieces, compute on each piece, and glue the result back together again. Recent developments, especially in the purrr package, have made "split-apply-combine" even easier and more general. But this requires a certain comfort level with lists, especially with lists that are columns inside a data frame. This is unfamiliar to most of us. I give an overview of this set of problems and match them up with solutions based on grouped, nested, and split data frames.
This presentation covers Python most important data structures like Lists, Dictionaries, Sets and Tuples. Exception Handling and Random number generation using simple python module "random" also covered. Added simple python programs at the end of the presentation
The document provides information on arrays and hashes in Ruby. It discusses that arrays are ordered lists that can contain objects, and hashes are collections of key-value pairs. It then provides examples of creating, accessing, and modifying arrays and hashes. It also discusses various methods for iterating over arrays and hashes, such as each, collect, and each_pair.
Here are the steps to solve this problem:
1. Convert both lists of numbers to sets:
set1 = {11, 2, 3, 4, 15, 6, 7, 8, 9, 10}
set2 = {15, 2, 3, 4, 15, 6}
2. Find the intersection of the two sets:
intersection = set1.intersection(set2)
3. The number of elements in the intersection is the number of similar elements:
similarity = len(intersection)
4. Print the result:
print(similarity)
The similarity between the two sets is 4, since they both contain the elements {2, 3, 4, 15}.
Presented in Bangalore Open Java User Group on 21st Jan 2017
Awareness of design smells - Design comes before code. A care at design level can solve lot of problems.
Indicators of common design problems - helps developers or software engineers understand mistakes made while designing and apply design principles for creating high-quality designs. This presentation provides insights gained from performing refactoring in real-world projects to improve refactoring and reduce the time and costs of managing software projects. The talk also presents insightful anecdotes and case studies drawn from the trenches of real-world projects. By attending this talk, you will know pragmatic techniques for refactoring design smells to manage technical debt and to create and maintain high-quality software in practice. All the examples in this talk are in Java.
Haskell is a pure functional programming language that is statically typed and implements immutable data structures. It uses recursion extensively in place of loops. Some key features include:
- Functional programming with immutable values and functions as first-class citizens
- Static typing with type inference
- Recursion instead of loops for iteration
- List comprehensions and pattern matching
It has compilers like GHC and interpreters like Hugs. Code examples demonstrate list handling, recursion, and functional concepts like currying, partial application, and let/in bindings. Haskell has advantages for learning functional programming but also has disadvantages like initial complexity and lack of performance of other languages.
This document provides an overview of phylogenetic analysis tools and techniques available in R. It discusses how to get sequence data from GenBank, align sequences, perform phylogenetic inference using various methods like neighbor joining and maximum likelihood, visualize and analyze trees, model trait evolution, reconstruct ancestral states, simulate trees, and access phylogenetic data from online repositories. Examples are given for many of the tasks using popular R packages like ape, phangorn, picante, and phytools.
This document provides a lesson on Arrays and Hashes in Ruby. It discusses what Arrays and Hashes are, how to create them, and common operations like iteration, copying, and converting between Arrays and Hashes. It also includes exercises for learners to practice these concepts without using built-in methods like map, select, and inject. At the end, it prompts attendees to introduce themselves.
This document introduces Faisal Abid and provides a brief summary of his background and work experience. It mentions that he is a software engineer and entrepreneur who works on the tablet team at Kobo. It also lists some things he has worked on and blogs about, including the history of JavaScript and CoffeeScript.
This document provides an overview of features in the Guava library for Java, including:
- Using Optional to avoid null values and NullPointerExceptions.
- Using Preconditions to validate arguments and throw exceptions.
- Using Throwables to handle exceptions in a cleaner way.
- Functional idioms for working with collections and functions.
- Multisets and multimaps for representing collections that allow duplicates.
- Ranges for representing ranges of values like integers or characters.
- Hashing utilities for generating hashes like MD5.
Odessapy2013 - Graph databases and PythonMax Klymyshyn
Page 10 "Я из Одессы я просто бухаю." translation: I'm from Odessa I just drink. Meaning his drinking a lot of "Vodka" ^_^ (@tuc @hackernews)
This is local meme - when someone asking question and you will look stupid in case you don't have answer.
The document outlines homework expectations and activities for terms 3 and 4 of 2015 for the Puketapapa Team. Students are expected to complete 1-2 homework activities per fortnight, with some shaded activities taking longer to complete. Students should submit their highest quality work via several methods. In addition to homework activities, students are expected to read daily, participate in Mathletics, and use SpellingCity. The document provides several homework activity options centered around financial literacy.
Little Red Riding Hood brings cake and wine to her sick grandmother, who lives in a house in the woods. On her way, she meets a wolf who tricks her into picking flowers so he can reach the grandmother's house first. He eats the grandmother and waits for Little Red Riding Hood in her bed. When she arrives, he eats her too. A hunter later finds the wolf and cuts open his belly to save the two, finding them inside. They fill the wolf with stones so he dies, and all three live happily ever after.
Sarcoidosis is an autoimmune disease where the immune system overreacts and attacks the body's tissues, causing inflammation and formation of granulomas. It most commonly affects the lungs and lymph nodes in the chest, but can also impact the skin, eyes, liver, heart, nervous system, and musculoskeletal system. Symptoms vary depending on the affected organs but may include cough, skin lesions, eye irritation, joint pain, and fatigue. Diagnosis involves ruling out other potential diseases through tests like chest x-rays, blood tests, and biopsies of affected tissues showing non-caseating granulomas. Treatment typically involves corticosteroids to reduce inflammation, though some cases resolve without treatment.
Netiquette refers to etiquette on the Internet and provides guidelines for polite and appropriate online communication. The document lists several dos and don'ts of netiquette, including being polite, avoiding all capital letters, using emoticons to convey tone, keeping messages brief, identifying yourself, and being patient with newcomers. It also advises avoiding spamming, flaming, and using emoticons wisely in online interactions. Adhering to netiquette helps ensure respectful and productive communication.
Sarah Halstead presented on perspectives of poverty and class to Mid-State Technical College. Over two hours, she aimed to:
1) Examine how cultural values and experiences influence thinking and decision-making.
2) Recognize opportunities related to social class, privilege, and family circumstances.
3) Engage in building positive relationships across class lines.
4) Understand differences in language and communication among socio-economic groups.
5) Identify a goal for applying the information learned.
The document provides an overview of internet trends in Brazil and Latin America from November 2009. It finds that Brazil has the largest internet population in Latin America with over 30 million users, and users spend significant time online at over 26 hours per month on average. The internet audience in Brazil skews young, with 65% under 35 years old. Heavy internet users account for more page views than in other markets. Social networking, particularly Orkut, is very popular among Brazilian internet users.
This document provides a summary and analysis of benchmark data for online advertising performance between Q3 2008 and Q2 2009. It finds that while larger ad sizes tend to perform better for standard banners, size is not a strong predictor of performance for rich media ads, which are better optimized by adding features like video. Overall, the two most common ad sizes are 300x250 and 728x90, comprising 70% of impressions. The document also provides metrics on user engagement for rich media, such as average dwell rates and times.
Catalyst Group conducted a qualitative study to see whether people who’d never used digital books preferred the more popular Kindle or Sony’s eReader. The results are intriguing.
Online video advertising has grown tremendously in recent years, outpacing rich media advertising by 60%. Video ads engage users more by doubling dwell time and rates. They also boost ROI, with video ads generating twice the ROI of non-video rich media ads. The document provides best practices for different online video ad formats to maximize engagement, including recommendations around initiation methods, sound, call to actions, length, and placement.
The document provides information about an upcoming webquest activity for students about the 2011 Rugby World Cup. The Rugby World Cup will take place from September 9th to October 23rd in New Zealand, featuring 20 teams divided into 4 pools playing matches at different venues. As part of a class activity, students will work in partners to design a piece of commemorative memorabilia for the event that conveys factual information about the Rugby World Cup in an easy-to-create format that can be swapped between students and appeals to different age groups. Students will research commemorative memorabilia, design their own, create copies, and their work will be evaluated based on a provided rubric focusing on layout, graphics, facts included, and spelling
How to download Microsoft Security Essentials?jessecadelina
The document provides step-by-step instructions for downloading and installing the free Microsoft Security Essentials antivirus software. It directs the user to go to the Microsoft Security Essentials website, click download, select their language and operating system, save and run the file, and complete a brief validation process to finish installation.
This document discusses enabling data-intensive biology through superior software and algorithms. It proposes a distributed graph database server that would allow querying across multiple public and private data sets. This would help address the growing data challenge in biology by providing a way to explore, query and mine large datasets in an open and collaborative manner. The goal is to incentivize data sharing and enable new types of data-driven investigations.
This document discusses the adoption of Laurence in March 2013 but provides no other details about Laurence, the adoption process, or those involved. It is a very brief document that states when the adoption occurred but reveals no other important information about the event in just 3 words and a date.
The significance of higher-order ... procedures is that they enable us to represent procedural abstractions explicitly as elements in our programming language, so that they can be handled just like other computational elements.
Python quickstart for programmers: Python Kung Fuclimatewarrior
The document provides an overview of key Python concepts including data types, operators, control flow statements, functions, objects and classes. It discusses lists in depth, covering creation, iteration, searching and common list methods. It also briefly touches on modules, exceptions, inheritance and other advanced topics.
The document discusses the benefits of declarative programming using Scala. It provides examples of implementing algorithms and data structures declaratively in Scala. It also discusses the history and future of Scala, as well as how Scala encourages thinking about programs as transformations rather than changes to memory.
Kirby Urner discusses using Python to teach mathematics through programming and storytelling. Some key ideas include using Python to demonstrate mathematical concepts like functions, objects, algorithms, and data structures. Examples shown include generating sequences, animating polyhedral numbers, and building mathematical objects in Python. The document concludes that programming in Python can help build a stronger understanding of mathematics compared to specialized learning languages.
The document discusses key topics in software engineering including software products, product attributes, the importance of product characteristics, the software engineering process, engineering process models, software process models, and the advantages and problems of different process models. It introduces these topics and provides some brief explanations about each one.
Good practices for PrestaShop code security and optimizationPrestaShop
The document discusses various optimizations that can be made to improve the performance and security of a PrestaShop installation. It covers optimizations to server infrastructure, database queries, PHP code, and front-end performance. Key recommendations include using caching, minimizing database queries and regular expressions, compressing responses, and securing against common attacks like SQL injection. Measurements are suggested to identify bottlenecks before optimizing.
This document discusses new features in CakePHP version 2.2 related to view blocks, JSON and XML views, improved hashing performance, date/time utilities, and scoped logging. It provides examples of using view blocks to keep HTML DRY, creating JSON views, benchmarking hash performance improvements over sets, using the new CakeTime and CakeNumber utilities, and attaching loggers with scopes to filter log messages.
The document provides an overview of SQL and PHP for working with databases. It discusses SQL concepts like creating and modifying tables, inserting and selecting data. It then covers connecting to databases from PHP, executing SQL queries from PHP, and processing HTML forms to insert data into databases using PHP. Key topics include SQL syntax for common operations, the basic PHP code for connecting to MySQL, running queries, and retrieving result rows, and using the $_POST array to access form data submitted to a PHP processing page.
The document discusses 10 important C programming interview questions. It provides detailed solutions to questions such as swapping two variables without a temporary variable, solving the 8 queens problem, printing a matrix helically, reversing words in a sentence in-place, generating permutations, and calculating the factorial of a number recursively. For each question, it explains the algorithm and provides sample C code to implement the solution.
Object Orientation vs. Functional Programming in PythonPython Ireland
The document discusses object orientation and functional programming approaches in Python. It covers various object-oriented programming concepts like the template method pattern, abstract base classes, mixins, and composition. It also covers functional programming concepts like callbacks, higher-order functions, decorators, and partial function application. It concludes that Python supports both paradigms well and that depending on the situation, one approach may be more appropriate, but the tools can also complement each other.
In SQL joins are a fundamental concepts and many database engines have serious problems when it comes to joining many many tables.
PostgreSQL is a pretty cool database - the question is just: How many joins can it take?
It is known that Oracle does not accept insanely long queries and MySQL is known to core dump with 2000 tables.
This talk shows how to join 1 million tables with PostgreSQL.
Following a game show format made popular by Joshua Bloch and Neal Gafter's Java Puzzlers this presentation intends to both entertain and inform. Snippets of Python code the whose behaviour is not entirely obvious are shown, the audience will then be asked to pick from a number of options what the behaviour of the program is. The correct and sometimes non-intuitive answer will then be given along with a brief explanation of the idea the puzzle exposes. Only a modest working knowledge of the Python language is required to understand the puzzles, but the puzzles may also entertain the more experienced Python programmer.
The document contains code examples demonstrating various Scala programming concepts such as functions, pattern matching, traits, actors and more. It also includes links to online resources for learning Scala.
Introduction to source{d} Engine and source{d} Lookout source{d}
Join us for a presentation and demo of source{d} Engine and source{d} Lookout. Combining code retrieval, language agnostic parsing, and git management tools with familiar APIs parsing, source{d} Engine simplifies code analysis. source{d} Lookout, a service for assisted code review that enables running custom code analyzers on GitHub pull requests.
This document provides an agenda and overview for a class on using R to work with data. The class covers topics like calculating, joining, and grouping data in R; using R to build databases in Google Sheets; and introducing R Markdown for automating reporting. Specific sessions will demonstrate generating fake data from GitHub, data transformations with dplyr, different types of joins, uploading/downloading from Google Sheets, and creating dashboards in DataStudio.
The document provides an overview of the Scala programming language. It discusses how Scala removes some features from Java like break/continue and static, unifies functional programming and object-oriented programming, and treats functions as first-class objects. Key aspects of Scala covered include treating all operators as methods, higher-order functions, pattern matching with case classes, and functional operations on collections like List.
This document summarizes a lecture about practical programming in Haskell. It discusses reading files, counting words, and improving performance. It shows how to:
1) Read a file into a bytestring for efficient processing, count the words, and print the result.
2) Use the Data.Text module for Unicode support when processing text files, reading the file as bytes and decoding to text before counting words.
3) Achieve performance comparable or better than C implementations when choosing efficient data representations like bytestrings and text.
The document discusses concurrency models and patterns in programming languages. It describes how features like first-class functions allow some patterns to be invisible in languages. Common patterns like threading and actors are discussed, along with implementations using Communicating Sequential Processes and the actor model in different languages. The goal is to irritate the reader by discussing these concepts.
Similar to Pycon 2011 talk (may not be final, note) (20)
This document outlines a 12-step program for biology to adapt to the era of data-intensive science. It summarizes the author's background and research interests. It then discusses the rapid growth of biological data from techniques like DNA sequencing. It introduces the concept of digital normalization as a way to efficiently process large transcriptome datasets. Finally, it outlines some proposed steps for the field, including investing in computational training, a focus on biological questions, and moving to continuous data updating models.
The document discusses the challenges and opportunities that will arise from the exponential growth of biological data in the coming years. It outlines four key areas: 1) Research approaches will need to effectively analyze infinite amounts of data. 2) Software and decentralized infrastructure will be needed to process the data. 3) Open science and reproducible research practices are important for data-driven biology. 4) Training the next generation of biologists in data analysis skills will be a major challenge. The document advocates for open source tools, reproducible research methods, and expanded training programs to help biology take advantage of the coming data deluge.
- Biology is generating vast amounts of data from techniques like DNA sequencing that is growing exponentially. However, biology is unprepared to effectively analyze and utilize this "big data".
- There are few researchers trained in both data analysis and biology. Data is often not shared between researchers, and the current publishing system discourages open sharing of knowledge. Most computational research also lacks reproducibility.
- The presenter advocates for open science, data sharing, and improved training for the next generation of researchers in both biology and data analysis skills to help address these challenges and better leverage the growing stores of biological data.
This document discusses the challenges and opportunities biology faces with increasing data generation. It outlines four key points:
1) Research approaches for analyzing infinite genomic data streams, such as digital normalization which compresses data while retaining information.
2) The need for usable software and decentralized infrastructure to perform real-time, streaming data analysis.
3) The importance of open science and reproducibility given most researchers cannot replicate their own computational analyses.
4) The lack of data analysis training in biology and efforts at UC Davis to address this through workshops and community building.
Shotgun metagenomics involves collecting environmental samples, extracting DNA from the samples, sequencing the DNA using shotgun sequencing, and then analyzing the sequence data computationally. Key steps include assembling reads into longer contigs to aid analysis and annotation. While assembly works well for some datasets, challenges include repeats, low coverage of low-abundance species, and strain variation. High coverage, often 10x or more per genome, is critical for robust assembly. The amount of sequencing needed can be substantial, such as terabases of data to deeply sample microbial communities.
This document discusses complex metagenome assembly and career thoughts in bioinformatics. It begins with the speaker's research background and then discusses two main topics: 1) challenges with metagenome assembly due to low coverage regions and strain variation in sequencing data, and approaches using assembly graphs, and 2) the need for more "bioinformaticians in the middle" who are comfortable with both biology and computational analysis to integrate large-scale data into their research. The speaker provides advice for embracing computation and seeking formal training opportunities to develop skills at this intersection of disciplines.
This document summarizes a talk titled "AWager for 2016: How SoftwareWill Beat Hardware in Biological Data Analysis". The talk discusses how software approaches can outpace hardware for analyzing large biological datasets. It notes that current variant calling approaches have limitations due to being I/O intensive and requiring multiple passes over data. The talk introduces approaches using lossy compression and streaming algorithms that can perform analysis more efficiently using less memory and in a single pass. This could enable analyzing a human genome on a desktop computer by 2016 as wagered. The talk argues that with better algorithmic tools, biological data analysis need not require large computers and can scale with the information content of data rather than just data size.
The document discusses strategies for working with large biological datasets as sequencing costs decrease and data volumes increase exponentially. It summarizes three key uses for abundant sequencing data: hypothesis falsification, model comparison, and hypothesis generation. The author's lab aims to develop open tools for moving quickly from raw data to hypotheses and identify challenges preventing collaborators from doing their science. Summarizing a discussion on soil microbial communities, it notes the immense diversity and challenges of culture-dependent approaches, necessitating single-cell sequencing and metagenomics.
The document discusses how to interpret a person's genome sequence. It explains that while we can identify genetic variations and inherited conditions, a genome sequence alone does not reveal much because DNA is complex code that is difficult to decipher. Environmental factors and how genes interact also influence traits. The document outlines the process of genome sequencing, mapping reads to a reference, variant calling, and challenges in interpretation due to incomplete knowledge and versioning issues.
C. Titus Brown provides a summary of his career path and activities leading up to receiving tenure as an Assistant Professor. He details his publication record, grant funding, teaching responsibilities, and other service activities over 7 years. Brown emphasizes the importance of pursuing open science practices like preprints, blogging, and open source software throughout his career. He also notes the significant role of luck and timing in his success. Brown concludes by reflecting on lessons learned and looking ahead to his new role as an Associate Professor.
This document discusses new directions for the khmer bioinformatics platform, including developing semi-streaming algorithms for sequence analysis using k-mers. Digital normalization is presented as an initial approach that compresses sequencing data, though it discards information. Later work introduced a two-pass semi-streaming framework using saturation detection to enable error correction and variant calling using minimal memory. Current work includes developing a pair-HMM-based graph aligner and applying it to tasks like variant calling. The khmer platform provides implementations of these streaming algorithms to enable analysis of large genomic and metagenomic datasets.
This document discusses analyzing large sequencing datasets and summarizing metagenomic communities. It describes benchmarking different assembly methods on a mock community dataset. Digital normalization and partitioning treatments were found to save computational time without altering assembly results. Approximately 90% of genomes were recovered, with few misassemblies. Deeper sequencing is needed to fully reconstruct communities, with petabasepair sampling required. Computational resources must scale to analyze the large volumes of data that will be generated from deeper metagenomic surveys.
This document discusses concepts and tools for exploring large sequencing datasets. It begins by providing background on building tools to analyze large sequencing datasets and enabling scientists to quickly generate hypotheses from data. The document then outlines goals of enabling hypothesis-driven biology through better hypothesis generation and refinement, and making sequence analysis less valuable and putting the author out of a job. It presents a narrative arc that discusses reconstructing community genomes from shotgun metagenomics, underlying enabling approaches and tools, and a plan for positive global influence through technology and training.
This document discusses ways to incentivize scientists to share their data through self-interest. It describes two existing models where data sharing is successful: oceanographic research consortia that require data sharing, and biomedical research projects that organize data generation and sharing through a common platform. The document proposes a distributed graph database and computing platform that would allow researchers to query diverse public and private datasets, providing immediate returns for data sharing. By making others' data useful to analyze and mine, researchers would be competitively disadvantaged not to share their own data. The goal is to enable open sharing by addressing current problems and remaining agile for future needs.
This document summarizes Dr. C. Titus Brown's work improving the chicken genome and transcriptome. It discusses the current state of the chicken genome assembly (galGal2-4) and limitations, including issues with microchromosomes. It then compares the Moleculo and PacBio long-read sequencing technologies, showing that while Moleculo has higher throughput and lower error, it does not resolve missing genes. The document evaluates how different gene models (Ensembl vs GIMME) can affect pathway prediction from RNA-seq data, finding more complete pathways with GIMME. It stresses that reference genomes and gene models should be regularly updated to generate more accurate hypotheses from mRNA-seq data.
This document summarizes a study that benchmarked different metagenomic assembly approaches using a mock microbial community. The study found that while assembly generally improves functional annotation over analyzing unassembled reads, current assembly methods still have room for improvement, especially regarding misassemblies. The document also describes efforts to establish standardized assembly protocols and benchmarks in order to evaluate progress and better understand the challenges. Computational requirements for assembly remain high but are decreasing as methods improve.
This document discusses scalable computational approaches for exploring microbial diversity using metagenomic sequencing data. It describes a "digital normalization" algorithm that uses a streaming computational approach to lossy compression of sequencing data in a memory- and time-efficient way. This allows assembly and analysis of very large soil metagenomic datasets totaling over 1.8 terabases. Comparison of Iowa prairie and corn field samples showed 51% nucleotide overlap, suggesting similar genomic content between these environments.
This document discusses opportunities and challenges presented by next-generation DNA sequencing technologies. It begins by introducing the speaker, C. Titus Brown, and their commitment to open science. It then describes the dramatic decreases in cost and increases in scale of DNA sequencing. While this enables sequencing entire genomes and environmental samples, it presents challenges for analysis due to lack of reference genomes and limited computational tools. The document outlines goals for shotgun sequencing analysis and challenges for non-model organisms. It concludes by emphasizing the need for training in data analysis to take advantage of the vast amounts of sequencing data being generated.
This document discusses digital normalization, a technique for eliminating redundant sequencing data while retaining genomic information. It can be applied as a streaming algorithm to process sequencing data in a memory- and time-efficient manner. Digital normalization allows assembly graphs to scale with the underlying genome size rather than the total amount of data by discarding redundant reads and errors during sequencing. It enables assembly of various types of genomic data using limited computational resources.
5. Hat tip to Narayan Desai / ANLWe don’t have enough resources or people to analyze data.
6. Data generation vs data analysisIt now costs about $10,000 to generate a 200 GB sequencing data set (DNA) in about a week. (Think: resequencing human; sequencing expressed genes; sequencing metagenomes, etc.) …x1000 sequencersMany useful analyses do not scale linearly in RAM or CPU with the amount of data.
7. The challenge?Massive (and increasing) data generation capacity, operating at a boutique level, with algorithms that are wholly incapable of scaling to the data volume.Note: cloud computing isn’t a solution to a sustained scaling problem!! (See: Moore’s Law slide)
8. Life’s too short to tackle the easy problems – come to academia!Easy stuff like Google SearchAwesomeness
9. A brief intro to shotgun assemblyIt was the best of times, it was the wor, it was the worst of times, it was the isdom, it was the age of foolishnessmes, it was the age of wisdom, it was thIt was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness…but for 2 bn fragments.Not subdivisible; not easy to distribute; memory intensive.
10. Define a hash function (word => num)def hash(word): assert len(word) <= MAX_K value = 0 for n, ch in enumerate(word): value += ord(ch) * 128**n return value
11. class BloomFilter(object): def __init__(self, tablesizes, k=DEFAULT_K):self.tables = [ (size, [0] * size) br /> for size in tablesizes ]self.k = k def add(self, word): # insert; ignore collisionsval = hash(word) for size, ht in self.tables:ht[val % size] = 1 def __contains__(self, word):val = hash(word) return all( ht[val % size] br /> for (size, ht) in self.tables )
12. class BloomFilter(object): def __init__(self, tablesizes, k=DEFAULT_K):self.tables = [ (size, [0] * size) br /> for size in tablesizes ]self.k = k def add(self, word): # insert; ignore collisionsval = hash(word) for size, ht in self.tables:ht[val % size] = 1 def __contains__(self, word):val = hash(word) return all( ht[val % size] br /> for (size, ht) in self.tables )
13. class BloomFilter(object): def __init__(self, tablesizes, k=DEFAULT_K):self.tables = [ (size, [0] * size) br /> for size in tablesizes ]self.k = k def add(self, word): # insert; ignore collisionsval = hash(word) for size, ht in self.tables:ht[val % size] = 1 def __contains__(self, word):val = hash(word) return all( ht[val % size] br /> for (size, ht) in self.tables )
14. Storing words in a Bloom filter>>> x = BloomFilter([1001, 1003, 1005])>>> 'oogaboog' in xFalse>>> x.add('oogaboog')>>> 'oogaboog' in xTrue>>> x = BloomFilter([2]) # …false positives>>> x.add('a')>>> 'a' in xTrue>>> 'b' in xFalse>>> 'c' in xTrue
15. Storing words in a Bloom filter>>> x = BloomFilter([1001, 1003, 1005])>>> 'oogaboog' in xFalse>>> x.add('oogaboog')>>> 'oogaboog' in xTrue>>> x = BloomFilter([2]) # …false positives>>> x.add('a')>>> 'a' in xTrue>>> 'b' in xFalse>>> 'c' in xTrue
16. Storing text in a Bloom filterclass BloomFilter(object): … def insert_text(self, text): for i in range(len(text)-self.k+1):self.add(text[i:i+self.k])
17. def next_words(bf, word): # try all 1-ch extensions prefix = word[1:] for ch in bf.allchars: word = prefix + ch if word in bf: yield ch# descend into all successive 1-ch extensionsdef retrieve_all_sentences(bf, start): word = start[-bf.k:]n = -1 for n, ch in enumerate(next_words(bf, word)):ss = retrieve_all_sentences(bf,start + ch) for sentence in ss: yield sentence if n < 0: yield start
18. def next_words(bf, word): # try all 1-ch extensions prefix = word[1:] for ch in bf.allchars: word = prefix + ch if word in bf: yield ch# descend into all successive 1-ch extensionsdef retrieve_all_sentences(bf, start): word = start[-bf.k:]n = -1 for n, ch in enumerate(next_words(bf, word)):ss = retrieve_all_sentences(bf,start + ch) for sentence in ss: yield sentence if n < 0: yield start
19. Storing and retrieving text>>> x = BloomFilter([1001, 1003, 1005, 1007])>>> x.insert_text('foo bar bazbif zap!')>>> x.insert_text('the quick brown fox jumped over the lazy dog')>>> print retrieve_first_sentence(x, 'foo bar ')foo bar bazbif zap!>>> print retrieve_first_sentence(x, 'the quic')the quick brown fox jumped over the lazy dog
20. Sequence assembly>>> x = BloomFilter([1001, 1003, 1005, 1007])>>> x.insert_text('the quick brown fox jumped ')>>> x.insert_text('jumped over the lazy dog')>>> retrieve_first_sentence(x, 'the quic')the quick brown fox jumpedover the lazy dog(This is known as the de Bruin graph approach to assembly; c.f. Velvet, ABySS, SOAPdenovo)
21. Repetitive strings are the devil>>> x = BloomFilter([1001, 1003, 1005, 1007])>>> x.insert_text('nanana, batman!')>>> x.insert_text('my chemical romance: nanana')>>> retrieve_first_sentence(x, "my chemical")'my chemical romance: nanana, batman!'
22. Note, it’s a probabilistic data structureRetrieval errors:>>> x = BloomFilter([1001, 1003]) # small Bloom filter…>>> x.insert_text('the quick brown fox jumped over the lazy dog’)>>> retrieve_first_sentence(x, 'the quic'),('the quick brY',)
23. Assembling DNA sequenceCan’t directly assemble with Bloom filter approach (false connections, and also lacking many convenient graph properties)But we can use the data structure to grok graph properties and eliminate/break up data:Eliminate small graphs (no false negatives!)Disconnected partitions (parts -> map reduce)Local graph complexity reduction & error/artifact trimming…and then feed into other programs.This is a data reducing prefilter
24. Right, but does it work??Can assemble ~200 GB of metagenome DNA on a single 4xlarge EC2 node (68 GB of RAM) in 1 week ($500).…compare with not at allon a 512 GB RAM machine.Error/repeat trimming on a tricky worm genome: reduction from170 GB resident / 60 hrs54 GB resident / 13 hrs
25. How good is this graph representation?V. low false positive rates at ~2 bytes/k-mer;Nearly exact human genome graph in ~5 GB.Estimate we eventually need to store/traverse 50 billion k-mers (soil metagenome)Good failure mode: it’s all connected, Jim! (No loss of connections => good prefilter)Did I mention it’s constant memory? And independent of word size?…only works for de Bruijn graphs
26. Thoughts for the futureUnless your algorithm scales sub-linearly as you distribute it across multiple nodes (hah!), oryour problem size has an upper bound, cloud computing isn’t a long-term solution in bioinformaticsSynopsis data structures & algorithms (which incl. probabilistic data structures) are a neat approach to parsing problem structure.Scalable in-memory local graph exploration enables many other tricks, including near-optimal multinode graph distribution.