Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
:)
Tiny Google Projects
tiny   :projects
Tiny Google Projects
Tiny Google Projects
Tiny Google Projects
Tesseract OCR



1985       2006
HP       Google
Tesseract OCR



2006       2011
TIFF              *
Tesseract OCR



2009       2010
Text      layout
Tesseract OCR



2007          2011
 6               33
Tesseract OCR
  Arabic, English, Bulgarian, Catalan, Czech,
 Chinese (Simplified and Traditional), Danish
(standard and Fraktur script), German, Greek,
Finnish, French, Hebrew, Croatian, Hungarian,
Indonesian, Italian, Japanese, Korean, Latvian,
     Lithuanian, Dutch, Norwegian, Polish,
    Portuguese, Romanian, Russian, Slovak
   (standard and Fraktur script), Slovenian,
   Spanish, Serbian, Swedish, Tagalog, Thai,
       Turkish, Ukrainian and Vietnamese
Tesseract OCR

Officially supported:




 Probably runs on:
Image processing
Tiny Google Projects
Tiny Google Projects
Tiny Google Projects
Google Refine
Runs on:
Runs in:
Major features:

Import from anywhere
Faceting
Clustering
Split crate custom columns
GREL transformations
Export/etc
Tiny Google Projects
google protocol buffers

                                   Person person;
                                   person.set_id(123);




                               >
message Person {                   person.set_name("Bob");
  required int32 id = 1;           person.set_email("bob@example.com");
  required string name = 2;
  optional string email = 3;       fstream out("person.pb", ios::out ...
}                                  person.SerializeToOstream(&out);
                                   out.close();
512   bytes / tweet
  340,000,000   tweets / day (2012)
7,253,333,333   bytes / hour
    2,014,814   bytes / second
        1,921   Mbytes / second
       15,371   Mbits / second

           8    Tbytes / day (2011)

  Google: ~ 377M searches/day
+ =
+ =
+ =
>   + =
>   + =
>   + =
?

    MapReduce
Tiny Google Projects
snappy
http://code.google.com/p/snappy/
snappy


Fast                Stable




Robust
                  Free and BSD
Size(less is better)
                                             compression ratio (%)
80



70



60



50



40



30



20



10



0
     lzjb 2010 lzo 2.04 1x fastlz 0.1 - fastlz 0.1 - 3.6 vf lzf 3.6 uf lzrw1
                                                   lzf                         lzrw1-a   lzrw2   lzrw3   lzrw3-a   snappy   quicklz    quicklz
                                1            2                                                                       1.0    1.5.0 -1   1.5.0 -2
6
                                     Data types
                    5




                    4
compression ratio




                    3                                    snappy
                                                         zlib



                    2




                    1




                    0
                        plain text       html     jpeg
Size



from 20% to 100% bigger

                :(


     ...not for amazon glacier
Speed is better)
                                            Compression (MB/s) (more
250




200




150




100




50




  0
      lzjb 2010   lzo 2.04 fastlz 0.1 - fastlz 0.1 - 3.6 vf lzf 3.6 uf lzrw1
                                                   lzf                         lzrw1-a   lzrw2   lzrw3   lzrw3-a   snappy   quicklz    quicklz
                     1x         1            2                                                                       1.0    1.5.0 -1   1.5.0 -2
Speed is better)
                                          Decompression (MB/s) (more
500


450


400


350


300


250


200


150


100


50


  0
      lzjb 2010   lzo 2.04 fastlz 0.1 - fastlz 0.1 - 3.6 vf lzf 3.6 uf lzrw1
                                                   lzf                         lzrw1-a   lzrw2   lzrw3   lzrw3-a   snappy   quicklz    quicklz
                     1x         1            2                                                                       1.0    1.5.0 -1   1.5.0 -2
On 1 core of 64-bit Core i7 processor:

  • Compression:        250MB/s

  • Decompression: 500MB/s

                   :P
Portable, but...
Portable, but primarily optimized
for 64-bit x86-compatible
processors
Used:

 BigTable
MapReduce
Google RPC
 Hadoop
Bindings:
@TarasRoshko

       HTTP headers here:

http://code.google.com/p/snappy/
source/browse/trunk/framing_for
             mat.txt
QA?   Ostap Andrusiv

      Software Engineer
      Eleks software
      @p1f

More Related Content

Tiny Google Projects

  • 1. :)
  • 3. tiny :projects
  • 7. Tesseract OCR 1985 2006 HP Google
  • 8. Tesseract OCR 2006 2011 TIFF *
  • 9. Tesseract OCR 2009 2010 Text layout
  • 10. Tesseract OCR 2007 2011 6 33
  • 11. Tesseract OCR Arabic, English, Bulgarian, Catalan, Czech, Chinese (Simplified and Traditional), Danish (standard and Fraktur script), German, Greek, Finnish, French, Hebrew, Croatian, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak (standard and Fraktur script), Slovenian, Spanish, Serbian, Swedish, Tagalog, Thai, Turkish, Ukrainian and Vietnamese
  • 20. Major features: Import from anywhere Faceting Clustering Split crate custom columns GREL transformations Export/etc
  • 22. google protocol buffers Person person; person.set_id(123); > message Person { person.set_name("Bob"); required int32 id = 1; person.set_email("bob@example.com"); required string name = 2; optional string email = 3; fstream out("person.pb", ios::out ... } person.SerializeToOstream(&out); out.close();
  • 23. 512 bytes / tweet 340,000,000 tweets / day (2012) 7,253,333,333 bytes / hour 2,014,814 bytes / second 1,921 Mbytes / second 15,371 Mbits / second 8 Tbytes / day (2011) Google: ~ 377M searches/day
  • 24. + =
  • 25. + =
  • 26. + =
  • 27. > + =
  • 28. > + =
  • 29. > + = ? MapReduce
  • 32. snappy Fast Stable Robust Free and BSD
  • 33. Size(less is better) compression ratio (%) 80 70 60 50 40 30 20 10 0 lzjb 2010 lzo 2.04 1x fastlz 0.1 - fastlz 0.1 - 3.6 vf lzf 3.6 uf lzrw1 lzf lzrw1-a lzrw2 lzrw3 lzrw3-a snappy quicklz quicklz 1 2 1.0 1.5.0 -1 1.5.0 -2
  • 34. 6 Data types 5 4 compression ratio 3 snappy zlib 2 1 0 plain text html jpeg
  • 35. Size from 20% to 100% bigger :( ...not for amazon glacier
  • 36. Speed is better) Compression (MB/s) (more 250 200 150 100 50 0 lzjb 2010 lzo 2.04 fastlz 0.1 - fastlz 0.1 - 3.6 vf lzf 3.6 uf lzrw1 lzf lzrw1-a lzrw2 lzrw3 lzrw3-a snappy quicklz quicklz 1x 1 2 1.0 1.5.0 -1 1.5.0 -2
  • 37. Speed is better) Decompression (MB/s) (more 500 450 400 350 300 250 200 150 100 50 0 lzjb 2010 lzo 2.04 fastlz 0.1 - fastlz 0.1 - 3.6 vf lzf 3.6 uf lzrw1 lzf lzrw1-a lzrw2 lzrw3 lzrw3-a snappy quicklz quicklz 1x 1 2 1.0 1.5.0 -1 1.5.0 -2
  • 38. On 1 core of 64-bit Core i7 processor: • Compression: 250MB/s • Decompression: 500MB/s :P
  • 40. Portable, but primarily optimized for 64-bit x86-compatible processors
  • 43. @TarasRoshko HTTP headers here: http://code.google.com/p/snappy/ source/browse/trunk/framing_for mat.txt
  • 44. QA? Ostap Andrusiv Software Engineer Eleks software @p1f

Editor's Notes

  1. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  2. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  3. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  4. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  5. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  6. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  7. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  8. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  9. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  10. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  11. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  12. In-memory test (compression and decompression) with ENWIK8 using1 core of Intel Xeon X5355 @ 2.66GHz (64-bit compilation under gcc 4.1.1 (Linux) -O3 -fomit-frame-pointer -fstrict-aliasing -fforce-addr -ffast-math --param inline-unit-growth=999 -DNDEBUG)
  13. zlibsnappyplain text1.5-1.72.7html2-4 3-7 jpeg11
  14. http://aws.amazon.com/glacier/
  15. http://pastebin.com/SFaNzRuf
  16. http://encode.ru/threads/1255-Google-released-Snappy-compression-decompression-library
  17. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/
  18. http://www.cloudera.com/blog/2011/09/snappy-and-hadoop/