Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
My blog has been moved to ariya.ofilabs.com.
Showing posts with label optimization. Show all posts
Showing posts with label optimization. Show all posts

Tuesday, July 26, 2011

tablets and web performance

Benchmarks, and the results of running them, are attractive because they eliminate the need to digest an arbitrary complex machinery, reducing it into a meaningful and powerful number. Comparison is thereby made easier, as it is now a matter of playing who has the biggest gun game.

In the areas of web performance, every benchmark becomes more like durian, either you hate it or you love it. No matter how useful (or useless) a benchmark is, there are always folks who defend it and others who despise it. Signal-to-noise ratio changes dramatically when the discussion over some benchmark results is started.

I still reckon that in the years to come, what makes a great experience while browsing the web depends on the performance of (surprise!) DOM access. Common JavaScript frameworks (like jQuery, Prototype, Ext JS, MooTools, YUI, Dojo, and many others) still form the basis for a lot of rich web sites and interactive web applications out there, at least for the time being and till the near future.

While SunSpider and V8 benchmarks are geared towards pure JavaScript performance and Kraken is better suited for future heavyweight applications, Dromaeo becomes a solid candidate for DOM performance analysis. In particular, its set of DOM tests is very valuable because it presents a nice sample of the behavior of framework-based sites. In this context, butter-smooth DOM modification has a bigger impact than just blazing-fast trigonometric computation, at least for gajillions web pages out there.

Since more and more people are accessing the web through mobile platforms these days, I decided to test several popular tablets out there and summarize the result in one graph below (updated):

For the detailed comparisons, check out the complete Dromaeo numbers of all tablets (left-to-right: Galaxy Tab, iPad 2, Playbook, TouchPad). If you find the above result is different that what you test yourself, shout out. I want to be careful not to propagate any discrepancies or misleading results. As usual, take the above beautiful collection of colored bars with a pinch of salt.

Samsung Galaxy Tab 10.1 is powered by Android 3.1 (Honeycomb) Build HMJ37, iPad 2 is using iOS 4.3.3, RIM Playbook's firmware is 1.0.7.2670, which the HP TouchPad has webOS 3.0. The choice of the devices represent a variety of fresh ARM-based tablet operating systems in the market as of this writing.

With Qt coming closer and closer to the become a good companion of the green robot, I wonder how would QtWebKit compete with those numbers. I think we will find out the answer in a couple of months, maybe even sooner.

Wednesday, July 06, 2011

fluid animation with accelerated composition

Those who work on web-based applications on mobile platforms often recall the advice, "Use translate3d to make sure it's hardware accelerated". This advice seems magical at first, and I seldom find anyone who explains (or wants to explain) the actual machinery behind such a practical tip.

For a while, Safari (and Mobile Safari) was the only WebKit-based browser which supports hardware-accelerated CSS animation. Google Chrome caught up, QtWebKit-powered browser (like the one in Nokia N9) also finally supported it. Such a situation often gave the wrong impression that Apple kept the hardware-acceleration code for themselves.

The above two are basically the reasons for this blog post.

In case you miss it (before we dive in further), please read what I wrote before about different WebKit ports (to get the idea of implementation + back-end approach) and tiled backing store (decoupling web page complexity with smooth UX). The GraphicsContext abstraction will be specially useful in this topic. In particular, because animation is tightly related to efficient graphics.

Imagine if you have to move an image (of a unicorn, for example) from one position to another. The pseudo-code for doing it would be:

  for pos = startPosition to endPosition
    draw unicorn at pos

To ensure smooth 60fps, your inner loop has only 16 ms to draw that unicorn image. Usually this is a piece of cake because all the CPU does is sending the pixels of the unicorn image once to the GPU (in the form of texture) and then just refer the texture inside the animation loop. No heavy work is needed on the CPU and GPU sides.

If, however, what you draw is very complicated, e.g. formatted text consisting of different font typefaces and sizes, this gets hairy. The "draw" part can take more than 16 ms and the animation is not butter-smooth anymore. Because your text does not really change during the animation, only the position changes, the usual trick is to cache the text, i.e. draw it onto a buffer and just move around the buffer as needed. Again, the CPU just needs to push the buffer the GPU once:

    prepare a temporary buffer
    draw the text onto the buffer
    for pos = startPosition and endPosition
       set a new transformation matrix for the buffer

As you can imagine, that's exactly what happens when WebKit performs CSS animation. Instead of drawing your div (or whatever you animate) multiple times in different position, it prepares a layer and redirect the drawing there. After that, animation is a simple matter of manipulating the layer, e.g. moving it around. WebKit term for this (useful if you comb the source code) is accelerated composition accelerated compositing.

Side note: Mozilla has the same concept, available since Firefox 4, called Layer.

If you understand immediate vs retain mode rendering, non-composited vs composited is just like that. The idea to treat the render tree more like a scene graph, a stack of layers with different depth value.

Because composition reduces the computation burden (GPU can handle varying transformation matrix efficiently), the animation is smoother. This is not so noticeable if you have a modern machine. In the following video demo (http://youtu.be/KujWTTRkPkM), I have to use my slightly old Windows laptop to demonstrate the frames/second differences:

The excellent falling leaves animation is something you have seen before, back when WebKit support for CSS animation was announced.

Accelerated composition does not magically turn every WebKit ports capable of doing fluid animation. Analog to my previous rounded corner example, composition requires the support from the underlying platform. On Mac OS X port of WebKit, composition is mapped into CoreAnimation (part of CoreGraphics), the official API to have animated user interface. Same goes for iOS WebKit. On Chromium, it is hooked into sandboxed GPU process.

With QtWebKit, composition is achieved via Graphics View framework (read Noam's explanation for details). The previous video you have seen was created with QtWebKit, running without and with composition, i.e. QGraphicsWebView with different AcceleratedCompositingEnabled run-time setting. If you want to check out the code and try it yourself, head to the usual X2 repository and look under webkit/composition. Use spacebar (or mouse click) to switch between composited and non-composited mode. If there is no significant frame rate improvement, increase NUMBER_OF_LEAVES in leaves.js and rebuild. When compositing is active, press D to draw thin yellow border around each layer. Since it's all about Graphics View, this debugging is easy to implement. I just inject a custom BorderEffect, based on QGraphicsEffect (which I did prototype back when I was with Nokia):

Thus, there is nothing like hidden secret with respect to Safari hardware-accelerated CSS support. In fact, Safari is not different than other Mac apps. If you compile WebKit yourself and build an application with it, you would definitely get the animation with hardware acceleration support.

As the bonus, since Mac and iOS WebKit delegate the animation to CoreAnimation (CA), you can use various CA tweaks to debug it. CA_COLOR_OPAQUE=1 will emphasize each layer with red color overlay (as in the demo). While this applies to any CA-based apps (not limited to WebKit or Safari), it's still very useful nevertheless. Chromium's similar feature is --show-composited-layer-border command line option.

How does WebKit determine what to composited? Since the goal is to fully take advantage of the GPU, there are few particular operations which are suitable for such a composition. Among others are transparency (opacity < 1.0) and transformation matrix. Ideally we would just use composition for the entire web page. However, composition implies a higher memory allocation and a quite capable graphics processor. On mobile platforms, these two translate into additional critical factor: power consumption. Thus, one just needs to draw a line somewhere and stick with it. Hence, that's why currently (on iOS) translate3d and scale3d are using composition and their 2-D counterparts are not. Addendum: on the desktop WebKit, all transformed element is accelerated, regardless whether it's 2-D or 3-D.

If you make it this far, here are few final twists.

First of all, just like the tiled backing store approach I explained before, accelerated composition does not force you to use the graphics processor for everything. For efficiency, your layer (backing store) might be mapped to GPU textures. However, you are not obligated to prepare the layer, i.e. drawing onto it, using the GPU. As an example, you can use a software rasterizer to draw to a buffer which will be mapped to OpenGL texture.

In fact, a further variation of this would be not to use the GPU at all. This may come as a surprise to you but Android 2.2 (Froyo) added composition support (see the commit), albeit doing everything with in software (via its Skia graphics engine). The advantage is of course not that great (compared to using OpenGL ES entirely), however the improvement is really obvious. If you have two Android phones (of the same hardware specification), one still running the outdated 2.1 (Eclair) and the other with Froyo, just open the Falling Leaves demo and watch the frame rate difference.

With the non-GPU composition-based CSS animation in Froyo, translate3d and other similar tricks do not speed-up anything significantly. In fact, it may haunt you with bugs. For example, placing form elements in a div could wreck the touch events accuracy, mainly because the hit test procedures forget to take into account that the composited layer has moved. Things which seem to work just fine Eclair may start behaving weird under Froyo and Gingerbread. If that happens to you, check your CSS properties.

Fortunately (or unfortunately, depending on your point of view), Android madness with accelerated composition is getting better with Honeycomb and further upcoming releases. Meanwhile, just take it for granted that your magical translate3d spell has no effect on the green robots.

Last but not least, I'm pretty excited with the lightweight scene graph direction in the upcoming Qt 5. If any, this will become a better match for QtWebKit accelerated composition compared to the current Graphics View solution. This would totally destroy the myth (or misconception) that only native apps can take advantage of OpenGL (ES). Thus, if you decide to use web technologies via QtWebKit (possibly through the hybrid approach), your investment would be future-attractive!

Thursday, June 30, 2011

quaternion multiplication: two years later

Sometime back I wrote (fun fact: it's Google first hit for faster quaternion multiplication) about my favorite commit I did exactly two years ago to Qt :

git show cbc22908
commit cbc229081a9df67a577b4bea61ad6aac52d470cb
Author: Ariya Hidayat 
Date:   Tue Jun 30 11:18:03 2009 +0200

    Faster quaternion multiplications.
    
    Use the known factorization trick to speed-up quaternion multiplication.
    Now we need only 9 floating-point multiplications, instead of 16 (but
    at the cost of extra additions and subtractions).

Ages ago, during my Ph.D research, when I worked with a certain hardware platform (hint: it's not generalized CPU), minimizing the needed number of hardware multipliers with a very little impact in the computation speed makes a huge different. With today's advanced processor architecture armed with vectorized instructions and a really smart optimizing compiler, there is often no need to use the factorized version of the multiplication.

Side note: if you want to like quaternion, see this simple rotatation quiz which can be solved quite easily once you know quaternion.

I try to apply the same trick to PhiloGL, an excellent WebGL framework from Nicolas. Recently, to my delight, he added quaternion support to the accompanying math library in PhiloGL. I think this is a nice chance to try the old trick, as I had the expectation that reducing the number of multiplications from 16 to just 9 could give some slight performance advantage.

It turns out that it is not the case, at least based on the benchmark tests running on modern browsers with very capable JavaScript engine. You can try the test yourself at jsperf.com/quaternion-multiplication. I have no idea whether this is due to JSPerf (very unlikely) or it's simply because the longer construct of the factorized version does not really speed-up anything. If any, seems that the amount of executed instruction matters more than whether addition is much faster than multiplication. And of course, we're talking about modern CPU, the difference is then becoming more subtle.

With the help of Nicolas, I tried various other tricks to help the JavaScript engine, mainly around different ways to prepare the persistent temporary variables: using normal properties, using Array, using Float32Array (at the cost of precision). Nothing leads to any significant improvement.

Of course if you have other tricks in your sleeve, I welcome you to try it with the benchmark. Meanwhile, let's hope that someday some JavaScript engine will run the factorized version faster. It's just a much cooler way to multiply quaternions!

Monday, June 27, 2011

progressive rendering via tiled backing store

Imagine you have to create a CAD-grade application, e.g. drawing the entire wireframe of a space shuttle or showing the intricacies of 9-layer printed circuit board. Basically something that involves a heavy work to display the result on the screen. On top of that, the application is still expected to perform smoothly in case the user wants to pan/scroll around and zoom in/out.

The usual known trick to achieve this is by employing a backing store, i.e. off-screen buffer that serves as the target for the drawing operations. The user interface then takes the backing store and displays it to the user. Now panning is a matter of translation and zooming is just scaling. The backing store can be updated asynchronously, thus making the user interaction decoupled from the complexity of the rendering.

Moving to a higher ninja level, the backing store can be tiled. Instead of just one giant snapshot of the rendering output, it is broken down to small tiles, say 128x128 pixels. The nice thing is because each tile can be mapped as a texture in the GPU, e.g. via glTexImage2D. Drawing each textured tile is also a (tasty) piece of cake, GL_QUAD with glBindTexture.

Another common use-case for tiling is for online maps. You probably use it every day without realizing it in Google Maps, OpenStreetMap, or other similar services. In this case, the reason is to use tiles is mainly to ease the network aspect. Instead of sending a huge image representing the area seen by the user in the viewport, actually lots of small images are transported and stitched together by the client code in the web browser.

Here is an illustration of the concept. The border of each tile is emphasized. The faded area is what you don't see (outside the viewport). Of course every time you pan and zoom, new fresh tiles are fetched so as to cover the viewport as much as possible.

When I started to use the first generation iPhone years ago, I realized that the browser (or rather, its WebKit implementation) uses the very similar trick. Instead of drawing the web page straight to the screen, it uses a tiled backing store. Zooming (via pinching) becomes really cheap, it's a matter of scaling up and down. Flicking is the same case, translating textures does not bother any mobile GPU that much.

Every iOS users know that if you manage to beat the browser and flick fast enough, it tries to catch up and fills the screen as fast as possible but every now and then you'll see some sort of checkerboard pattern. That is actually the placeholder for tiles which are not ready yet.

Since all the geeks out there likely better understand the technique with a piece of code, I'll not waste more paragraphs and present you this week's X2 example: full-featured implementation of tiled backing store in < 500 lines of Qt and C++. You can get the code from the usual X2 git repository, look under graphics/backingstore. When you compile and launch it, use mouse dragging to pan around and mouse wheel to zoom in/out. For the impatient, see the following 50-second screencast (or watch directly on YouTube):

For this particular trick, what you render actually does not matter much (it could be anything). To simplify the code, I do not use WebKit and instead just focus on SVG rendering, in particular of that famous Tiger head. The code should be pretty self-explanatory, especially for the TextureBuffer class, but here is some random note for your pleasure.

At the beginning, every tile is invalid (=0). Every time the program needs to draw a tile, it checks first if the tile is valid or not. If yes, it substitutes it with the checkerboard pattern instead (also called the default texture) and triggers an asynchronous update process. During the update, the program looks for the most important tile which needs to be updated (usually the one closes to the viewport center). What is a tile update? It's the actual rendering of the SVG, clipped exactly to the rectangular bounding box represented by the tile, into a texture.

To show the mix-n-match, I actually use Qt built-in software rasterizer to draw the SVG. That demonstrates that, even though each tile is essentially an OpenGL texture, you are not forced to use OpenGL to prepare the tile itself. This is essentially mixing rasterization done by the CPU with the texture management handled by the GPU.

As I mentioned before, panning is a matter of adjusting the translation offset. Zooming is tricky, it involves scaling up (or down) the textures appropriately. At the same time, it also triggers an asynchronous refresh. The refresh function is nothing but to reset all the tiles to invalid again, which in turns would update each one by one. This gives the following effect (illustrated in the screenshot below). If suddenly you zoom in, you would see pixelated rendering (left). After a certain refresh delay, the tile update makes the rendering crisp again (right).

Zooming GLTiger

Because we still need to have the outdated tiles scaled up/down (those pixelated ones), we have to keep them around for a while until the refresh process is completed. This is why there is another texture buffer called the secondary background buffer. Rest assured, when none of the tiles in the background buffer is needed anymore, the buffer is flushed.

If you really want to follow the update and refresh, try to uncomment the debug compiler define. Beside showing the individual tiles better, that flag would also intentionally slows down both update and refresh so your eyes can have more time to trace them.

BTW how would you determine the tile dimension in pixels? Unfortunately this can vary from one hardware to another. Ideally it's not too small because you'd enjoy the penalty of logical overdraw. If it's too large, you might not be progressive enough. Trial and error, that can be your enlightenment process.

Being an example, this program has a lot of simplifications. First of all, usually you want the tile update to take place in a separate thread, and probably updating few tiles at once. With a proper thread affinity, this helps improving the overall perceptive smoothness. Also, in case you know upfront that it does not impact the performance that much, using texture filtering (instead of just GL_NEAREST) for the scaling would give a better zooming illusion.

You might also see that I decided not to use the tile cache approach in the texture buffer. This is again done for simplicity. The continuous pruning of unused textures ensures that we actually don't grow the textures and kill the GPU. If you really insist on the absolutely minimal amount of overdraw and texture usage, then go for a slightly complicated cache system.

Since I'm lazy, the example is using Open GL and quad drawing. If you want to run it on a mobile platform, you have patch it so that it works with Open GL ES. In all cases, converting it to use vertex and texture arrays is likely a wise initial step. While you are there, hook the touch events so you can also do the pinch-to-zoom effect.

If you are brave enough, here is another nice final finish (as suggested by Nicolas of InfoVis and PhiloGL fame). When you zoom in, the tiles in the center are prioritized. However, when you zoom out, the tiles nearby the viewport border should get the first priority, in order to fill the viewport as fast as possible.

Progressive rendering via a tiled backing store is the easiest way to take advantage of graphics processor. It's of course just one form, probably the simplest one, of hardware acceleration.

Friday, July 09, 2010

faster quaternion multiplication

Sweet memories, it was fun to derive it.

Faster here however must be taken with a grain of salt as the new code is not always guaranteed to be better pipelined.

And of course, it's trivial to beat this generic C code with architecture-specific hand-rolled assembly.

git show cbc22908
commit cbc229081a9df67a577b4bea61ad6aac52d470cb
Author: Ariya Hidayat 
Date:   Tue Jun 30 11:18:03 2009 +0200

    Faster quaternion multiplications.
    
    Use the known factorization trick to speed-up quaternion multiplication.
    Now we need only 9 floating-point multiplications, instead of 16 (but
    at the cost of extra additions and subtractions).
    
    Callgrind shows that the function now takes 299 instructions instead of
    318 instructions, which is not a big win. However I assume the speed-up
    has a better effect for mobile CPU, where multiplications are more
    expensive.
    
    Reviewed-by: Rhys Weatherley

diff --git a/src/gui/math3d/qquaternion.h b/src/gui/math3d/qquaternion.h
index 55c871d..9a1b590 100644
--- a/src/gui/math3d/qquaternion.h
+++ b/src/gui/math3d/qquaternion.h
@@ -198,24 +198,17 @@ inline QQuaternion &QQuaternion::operator*=(qreal factor)
 
 inline const QQuaternion operator*(const QQuaternion &q1, const QQuaternion& q2)
 {
-    // Algorithm from:
-    // http://www.j3d.org/matrix_faq/matrfaq_latest.html#Q53
-    float x = q1.wp * q2.xp +
-                    q1.xp * q2.wp +
-                    q1.yp * q2.zp -
-                    q1.zp * q2.yp;
-    float y = q1.wp * q2.yp +
-                    q1.yp * q2.wp +
-                    q1.zp * q2.xp -
-                    q1.xp * q2.zp;
-    float z = q1.wp * q2.zp +
-                    q1.zp * q2.wp +
-                    q1.xp * q2.yp -
-                    q1.yp * q2.xp;
-    float w = q1.wp * q2.wp -
-                    q1.xp * q2.xp -
-                    q1.yp * q2.yp -
-                    q1.zp * q2.zp;
+    float ww = (q1.zp + q1.xp) * (q2.xp + q2.yp);
+    float yy = (q1.wp - q1.yp) * (q2.wp + q2.zp);
+    float zz = (q1.wp + q1.yp) * (q2.wp - q2.zp);
+    float xx = ww + yy + zz;
+    float qq = 0.5 * (xx + (q1.zp - q1.xp) * (q2.xp - q2.yp));
+
+    float w = qq - ww + (q1.zp - q1.yp) * (q2.yp - q2.zp);
+    float x = qq - xx + (q1.xp + q1.wp) * (q2.xp + q2.wp);
+    float y = qq - yy + (q1.wp - q1.xp) * (q2.yp + q2.zp);
+    float z = qq - zz + (q1.zp + q1.yp) * (q2.wp - q2.xp);
+
     return QQuaternion(w, x, y, z, 1);
 }

Friday, September 11, 2009

SVG: parsing and content optimization

A few weeks ago, just for a change (between the usual QtWebKit bug-fixing and patches juggling), I did take a look at our QtSvg module. According to some internal reports, QtSvg is not fast enough when parsing a large and complicated SVG. Of course, slow is relative, slow to what. And arguably, parsing time is not as important as rendering time. But if you stash your user-interface elements in some sort of SVG theme, loading time becomes a factor (caching the pixmaps whenever possible also helps). Of course, reduced size served in a web server can decrease the bandwidth as well (think of all the SVGs in Wikipedia).

Still, I decided to have a look, just in case there are low-hanging fruits I can grab. And I was right, far from being an SVG expert, with just two days of work I managed to squeeze its performance a bit, which you'd enjoy already in the recent 4.6 preview.

The chart above - shorter is better - represents the comparison of the time spent in QSvgRenderer::load(), measured using CPU tick counter (in millions of ticks), comparing Qt 4.5 and 4.6. I also tested some other files as well, see the bigger bar charts. In all measurements, the 95% confidence intervals were well below 1%. In-house Theme refers to an internal SVG that unfortunately I can't share. Tiger is the SVG version of the famous head in PostScript (taken from GNU GhostScript), something I have shown before. Imperial Coat of Arms of France is another complex SVG, from Wikipedia Commons. World Map is the public domain blank grayscale world map from Wikipedia. There are a bunch of other test files I used, they mostly show the same improvements.

As you can see, Qt 4.6 would enjoy a bit of speed-up (in some cases up to 1.4x) when loading and parsing SVG.

However, I did not stop there. For the fun of it, I quickly hacked a Qt-based, command line SVG minifier, dubbed SVGMin. More about it can be read in the detailed Quick Start, but basically it tries to eliminate redundant garbages which have no effect whatsoever in the final rendering.

What follows is the chart showing the same type of measurement but I added the result with the minified SVG (see also the full comparison chart). The result should speak for itself:

I plan some more improvements to the SVG minifier, for example collapsing a single grouped element (<g><path ...></g> makes no sense), group a bunch of nodes with similar attributes (no need to duplicate the same fill colors over 100 circles), remove useless attributes (why there is fill-* for fill:none?), and many others. Hold your breath.

Wednesday, May 20, 2009

Chrome Experiments, Flash-killer, Monster Evolution

These days I am juggling balls. Beside fixing QtWebKit-related bugs, writing more examples on using QtWebKit, I am helping the Kinetic guys with Graphics View optimizations, and working on new Graphics View feature I still can't elaborate (sorry for the teaser :-). On top of that, I already started to work on Qt Script, learning the intricacies of JavaScript along the way, stress-testing Kent's Qt Script binding generators, and surely also playing around with the brand-new Qt Script debugger in Qt 4.5. For the latter, even if you are not really into Qt Script, I highly recommend giving it a try, at least watch the screencast from Kent for a start.

Unless you stayed under the rock for the last few weeks, you already heard about Chrome Experiments. These are some extremely cool demos designed to run in a web browser. Of course, there are already tons of demos out there. Chrome Experiments are however different. It relies on a high-performance browser. The reason: the demo is JavaScript intensive and thus a blazing-fast JavaScript engine will make a different. Coupled with the use of HTML 5 Canvas, some of demos simply show many things which are not possible before. Flash will be still here to stay (due to its authoring tools, advanced features, libraries collections, and so on). But I won't be too surprised if this is going to be a Flash-killer technology. There is O3D but I reckon it tackles a different market segment.

My favorite Chrome Experiment is Monster Evolution, an fantastic demo written by Dean McNamee. Try to launch it in your browser (warning: extremely slow if you don't use state-of-the-art browser). Or just watch the YouTube video. Impressive, isn't it?

When I saw Monster demo for the time, I thought it would be cool to be able to run it as a Qt application. Porting the demo from JavaScript to C++/Qt is one way to do it, but to make the challenge even more painful, I decided to run the monster.js code directly (warning: it's minified, better check Dean's open-source 3-D engine behind the demo). There are tons of ways to do, I almost tried every possible permutations. In the end, I managed to run Monster Evolution smoothly, at more than 25 fps on a fairly modern machine. What was to be just a quick hack turned into a struggling but delirious adventure, resulting in a three-installment series.

For the YouTube generation, here is the time-lapsed screencast (if you prefer, grab the 3 MB AVI).

In the first part, The QtScript Menace, I used Qt's built-in ECMAScript interpreter to run the demo. The performance was not my main concern (though it gave me the chance of using our work-in-progress Qt Script version that uses JavaScriptCore as the back-end), rather the trick on how to run it with as little code as possible. I ended up with a hackish pure JavaScript implementation of the canvas object, along with few lines of glue code, using Qt Script's feature of making a QObject instance available to the script engine. Surprisingly, it works pretty well. It downloads the JavaScript code from the Internet, setups some stuff and then runs it. For a program comprises 240 lines of C++ and 140 line of JavaScript, I am pretty happy.

For the second attempt, Attack of the SquirrelFish, JavaScriptCore was chosen as the engine that powers the demo, used via QtWebKit. Again the same trick was employed, with the glue code now relies on QWebFrame's addToJavaScriptWindowObject and evaluateJavaScript. This requires only minimal changes, with an improved performance as the result (significant and noticeable), especially when JIT is available.

The saga was closed with the third episode, Revenge of the Cylinders. This time the victim was V8, the JavaScript engine which powers Google Chrome. Again, the code change was minimal, consider that V8 glue code to this little Qt application needs to be written manually. Of course this requires you to build V8, I have included the instructions (works on Linux, Mac, Windows) in the accompanying README on how to do that.

Feel free to try all three methods and let us know the frame-per-second speed-up that you get!

Tuesday, March 03, 2009

quattro cinque zero

Today (three-three-nine, no conspiracy about the chosen date, please!) Qt Software releases Qt 4.5.0!

There are a number of interesting things about this particular release: lots of performance improvements, simultaneous release of Qt Creator 1.0 which enables the Qt SDK package, LGPL as the new license option, and plenty of new and improved features.

Personally this is an exciting release for me, since it contains some of my work since I joined Qt Software (nee Trolltech). Among others, I did a bit of optimizations for Graphics View, touched the low-level graphics stuff at times where I understood it, and of course - my main task - integrated a lot of WebKit features into the QtWebKit module.

Download Qt 4.5 (or via torrents) while it is fresh. Especially with the SDK package, there is no excuse not to learn C++ and Qt anymore.

PS: can you spot me in the group picture? :-)

Friday, February 27, 2009

JavaScript speed race: reloaded

After the recent public beta release of Safari 4, it is time to do another round of JavaScript performance testing (see the last one I did). Here I compare the unstable/development releases of different web browsers, when running SunSpider benchmark (runs/minute) and V8 benchmark (raw score). The test machine is Lenovo T61 laptop armed with Intel Core2 Duo 2 GHz, 2 GB RAM running Windows XP Professional SP2. The results follow (longer is better):

Google Chrome 2.0 is the unstable version from the developer channel, where Chrome 1.0 is the stable one. Opera 10.0 Alpha unfortunately still does not include Carakan, the brand-new fast JavaScript engine from Opera. Firefox 3.1 is tested with TraceMonkey enabled, i.e. via javascript.options.jit.chrome set to true in about:config. Konqueror 4.2 is installed from the KDE Windows project (MSVC 2005 built), the latest stable version because its latest unstable still points to 4.1.96. For safety reason, Internet Explorer 8 runs inside Xenocode browser sandbox as the sandboxing overhead was found to be negligible. The laptop's power manager tool is set to give maximum performance, thus forcing the laptop to always runs at its maximum speed.

Monday, January 26, 2009

Qt::CheatTransform

Qt::CheatTransform was supposed to be a joke in Qt Developer Days graphics talk (though some people took it seriously). It still resembles the idea of blazing-fast by cheating whenever you can. Basically this is about creating a thumbnail preview of a large image in an optimized way. Downscaling an image is usually done using QImage::scaled() function. You have two choices: Qt::SmoothTransformation gives the best quality but very slow or Qt::FastTransformation is extremely fast at the expense of the quality. But what if there is a compromise: not too slow but the quality is still acceptable? Well, apparently it is possible to do. Check out my latest Qt Labs blog which exactly exposes the trick.

In the above screenshot, the result of downscaling a 10-megapixel picture is depicted. There are three images, each for Qt::FastTransformation, Qt::SmoothTransformation and one "cheat scaling" method. Use the key 1 to 3 to switch the scaling method and watch out the bottom right image. If you flip back and forth using 2 (for Qt::SmoothTransformation) and 3 (using the cheat method), can you spot the different pixels immediately? Do you think the cheat downscaled image is good enough?

How about the speed? Check out the following chart (longer is better), which speaks for itself:

Of course your milage may vary. The above comparison is for downscaling a 10-megapixel image to something like 200x150 pixels. You may gain less for smaller source images, though.

Tuesday, January 20, 2009

secrets, a sign, a reason

Let us start with a comparison (longer is better):

If you see the my latest graphics example: 50% scaling of (A)RGB32 image, you will find a 10x faster way (compared to QImage::scaled() function) to downscale an image to the half its original size, of course with (approximately) the same visual quality as when you use Qt::SmoothTransformation.

For the readers who also did listened to my Qt Developer Days graphics talk, you can have an idea why I make so much fuss just for image halfscaling :-) Bear with me and we will reach that point.

Monday, December 15, 2008

genie in a bottle

Let us start with a screenshot:

For the code, the hack behind it, and a screencast/video, check out my Qt Labs blog entry.

Saturday, December 13, 2008

every day a false start and it burns my heart

This week in Oslo we have visitors, as evidenced from this picture:


(Zoltan, Holger, Simon, another Zoltan, Enrico, Tor Arne, Ariya).

Basically we are doing a QtWebKit hackfest. Beside the usual three musketeers of us (Guardians of QtWebKit here in Oslo), we have also two Zoltans and Akos (not in the picture) coming from University of Szeged, as well as Holger (WebKit open-source developer) and Enrico (Italian ueber-hacker).

We managed to nail down a lots of stuff, among other the discussion about development workflow using git, a heavy round of final API review, brainstorming on the DOM API, ACID3 patches merge and some font handlings, lots of bug fixes and touches such as fix for annoying focus problem, cursor flashing, proper non SVG build, multiple file chooser support, state save and restore signals, native plugins instance lifetime, file extension for images, autogenerated inspector qrc, fix for Enter does not work, missing plugin icon, and many other stuff, including non-technical ones.

It is a fun hackfest and we certainly need to do this more often :-)

Tuesday, November 04, 2008

world fastest optical polarization tracking

Once a while, somebody asked me about my dissertation. Since I am working in software industry these days, it is a bit awkward to mention my previous physics and electrical-engineering related work. Now the job becomes easier. I can point those curious guys to the following paper:

High-speed endless optical polarization stabilization using calibrated waveplates
and field-programmable gate array-based digital controller

This 8-page text is basically a condensed version of my dissertation, the 2.5 MB PDF file is available for download (free). It would appear in the upcoming Vol. 16, Issue 23 of Optics Express, one of the leading peer-reviewed journal in optics (impact factor last year: 3.709).

In a nutshell, the paper consists of two parts: optical retarder characterization and implementation of FPGA-based controller. The first part deals with modelling and estimation of characteristics of off-the-shelf lithium-niobate polarization transformers, then the characterization result is used to calibrate the respective retarders. It uses a new approach based on quaternion analysis. Learning quaternion is a life-changing experience for me, and it seems natural to use it in this context. Unfortunately, as a footnote in my dissertation says "Although, rather surprisingly, quaternion is hardly employed in literature on polarization analysis".

The second part is a bit about the controller itself. It comprises an FPGA, some peripherals, and few optical components. The implementation employs a lot of tricks that were imaginable at that time so that the controller can run as fast as possible (control iteration time of 2 us) and in limited resources available on Xilinx Spartan 3. It was a hard task for the few of us. Finally we pulled it off and made it work reliably in a series of experiments. To date, the 15 krad/s stabilization experiment that is reported in that paper is still the fastest endless optical polarization tracking ever recorded on this planet.

Even these days, I often ask myself, what was in my mind when I decided to pick up this challenge back then?

Saturday, August 23, 2008

Qt 4.4 and Maemo

After playing a while with Qt 4.4 on my N800 (will be the same for N810), the impression of "slow" (like latest Adam's latest post) is something that I predicted will soon come out. So far, my very brief investigation reveals that the cause for this is both simple and sad: the Maemo's X is running 16-bit visual (N8xx's display is Highcolor only, not Truecolor). Thus, if you do something fancy with 32-bit stuff, like running a QPainter on a Format_ARGB32 or Format_ARGB32_Premultiplied QImage, then basically you force a 32-to-16 conversion every time you blit the image.

Take a look at PictureFlow, the infamous clone of Cover Flow. You can just checkout it, run qmake and make as usual and start to have some fun (all inside Scratchbox, and after that transfer it to the N8xx).

Surprisingly, it runs horribly slow, probably less than 5 fps. Consider than PictureFlow does its magic by straight pixel manipulation and then just blit the result, this looks really weird. However, it was found out that Qt needs to perform the conversion from 32-bit image to 16-bit pixmap for every single frame, hence the terrible slow down. In short, there is more CPU power wasted for the blitting vs for the rendering. Check your callgrind output to confirm this.

Good news: this is something that can be fixed (both in the application level and probably inside Qt). Personally this one is my favorite aim for Qt 4.5. Once we start to see ported Qt apps for Maemo, we can't afford to allow a potential performance penalty like this, it would just give a wrong impression of Qt. Don't you agree?

Sidenote: 16-bit can be useful because (on a non-accelerated graphics system), you can perform faster full-screen update (less transferred bytes). 32-bit is however easier to handle because each color component lives in a separate byte.

Sunday, June 15, 2008

JavaScript speed race

Just for fun, I tested JavaScript performance of several web browsers, using the well-known SunSpider benchmark tool. The test machine is a fairly old Fujitsu-Siemens Amilo notebook with AMD Turion64 1.8 GHz processor and 1 GB RAM, running Microsoft Windows XP SP2.

SunSpider benchmark results

The result is not surprising. Internet Explorer is notoriously slow but there is hope with IE8. Mozilla developers have done a great job optimizing Firefox. WebKit with SquirrelFish (and surely the upcoming Safari 4) really shines in speed.

Monday, February 12, 2007

Modulus with Mersenne prime

Consider the following simple operation, where k is integer and p is prime:

  int i = k % p;

Typically this will be assembled to (sorry, x86 only):

  mov  eax, k
  cdq
  idiv p

where the result is available in register EDX.

Such IDIV instruction has a latency of more than 40 cycles on Intel Pentium or AMD64 processor family. Hence, for optimization purposes, it is best to avoid integer division.

The above division/modulus operation can be avoided if the prime number p is chosen to be the Mersenne primes only, i.e there is a positive integer s such as p = 2s-1. In 32-bit range, there are Mersenne primes: 3, 7, 31, 127, 8191, 131071, 524287, and 2147483647.

The modulus operator with Mersenne prime can be simplified as:

  int i = (k & p) + (k >> s);
  return (i >= p) ? i - p : i;

Update: This works only for k up to (1 << (2 * s)) - 1, so careful with small Mersenne primes. Otherwise you need to call the function recursively.

One possible assembler implementation is as follows.

  ; assume edx = k, ebx = p, ecx = s
  ; result is in eax
  mov     eax, edx
  sar     edx, cl        ; k >> s
  and     eax, ebx       ; k & p
  add     eax, edx       ; eax is i = (k & p) + (k >> s)
  mov     edx, eax       ; edx is also i
  sub     edx, ebx       ; i - p
  cmp     eax, ebx       ; only if (i >= p)
  cmovge  eax, edx       ; then eax is (i-p)

Note that with the help of CMOVGE (6 cycles latency on Pentium 4), there is no need for real branching (which is expensive). Although the code is longer compared to the IDIV version, it executes much faster. Still faster if the range k is limited so that only a decrement operator is needed. Even faster of course if p is constant.

Last time I used this is for hash table (micro)optimization, because the number of stored items is known and I can live with a table whose size is a Mersenne prime. That's indeed a very special case only.

BTW this is off-topic, but thanks for those who mailed/texted me after this post. Nothing happened to me, I'm just fine. Those lines are from Keane's latest single (guess which one?), picked for no particular reason other that it's a good ballad.

Friday, November 24, 2006

Memory efficient DOM (Part 2)

This is the follow-up to the first part of my writing on memory efficient DOM implementation for KOffice. Now my intention is to find out whether real-time compression can help to further reduce the memory consumption.

XML compression is a whole field of research. There are a bunch of compressor available, from XMill, XMLZip, XMLPPM, XGrind, XML-Xpress, Exalt, TREECHOP and many others. Since I'm not a scholar in this field (I already have my own graduate research to take care of), I have chosen a direct and brutal approach: just compress the packed nodes representation and decompress them on-the-fly. While iteration is very common, simple cache buffer for decompressed neighbouring nodes is implemented so there is no need to always decompress every single node. This method easily fulfills the requirements that I set: no need to load the whole XML (because it intercepts SAX events), allows queries and iteration (due to the QDom-like wrapper), and can be conveniently tested with the compression switched off.

There are many algorithms suitable for fast compression and decompression. Byte-aligned Lempel-Ziv is a logical and obvious choice. Again, I have very much zero knowledge about data compression. There are Ross Williams's LZRW1, Herman Vogt's LZV, Markus Oberhumer's LZO, Marc Lehmann's LZF and many others. LZRW1 is well-known although probably covered by patents. LZV has the poorest compression ratio so it goes out of the list. LZO's compression is the best, at the price of a bit more complicated state-machine, but my own implementation (since the reference implementation is GPL only) is not so fast and well debugged. LZF, the successor to LZV, is a good contender and already used (among others) in suspend2 project. On fairly modern machine with large CPU cache and reasonable branch predictor, liblzf (the reference implementation of LZF) is much slower than LZRW1. So what I did is to recreate the compressor. Armed with Valgrind's Cachegrind, my new implementation is finally faster than LZRW1.

What I am really eager to try (but no time for that yet) is Wilson-Kaplan algorithm. At least the WK4x4 and WKdm variants have been implemented for Linux compressed caching project. And since the compression/decompression inside KoXmlReader is actually involving in-memory data, this fast and low-overhead WK algorithm should be a good candidate. There are assorted patent applications for such cache compression, but I wonder whether the algorithm per se is patented.

Now on to the result, starting with the worst test document: Frequencies.ods. Just to remind you, loading and iterating this document (19 MB XML) takes 250 MB if we're using QDom but only 40 MB with KoXmlReader. Here what Valgrind/Massif reported when the compact form of the DOM is compressed and decompressed on the fly. At most, this barely consumes 8 MB. Compared to 250 MB, 40 MB is good enough, but 8 MB is great, isn't it?

Interesting is, during the parsing phase, the heap allocation increases only (almost) linearly.

The full comparison with other documents is shown below. Amazing to see how short the violet bars are.

As for speed, regardless the lightning fast packing and unpacking, there's some penalty. This graph below summarizes the timing required for parsing (light bars) and iterating (dark bars). Compare to what was shown in part 1, I only added the result for KoXmlReader with compression. Small documents are omitted because processing time is short anyway.

Of course, all these graphs must be taken with a grain of salt. But my objective is only to find out whether this all makes sense, not to write a full-brown research paper. So far I'm quite happy with the result.

It can be seen that KoXmlReader is a feasible alternative for QDom (in KOffice) as it does not incur any significant performance penalty yet the memory footprint in the worst case (Frequencies.ods) can only be one sixth (40 MB) compared to that of QDom (250 MB). We also see the additional benefit of using real-time compression. Compared to plain KoXmlReader, although in the worst case it is 1/5 slower (and there is still room for further improvements here), the allocated heap is only one fifth (8 MB vs 40 MB). Or even 1/30 if you bravely compare it against QDom (8 MB vs 250 MB).

I guess we should empirically find out how large the average documents that most people are working with. If these are less than the worst case that I tested, then enabling compression by default is a good thing because the performance penalty won't be noticable. If not, even without compression it is already much more better than using QDom.

Other than that, my aim now is to fully integrate KoXmlReader in KOffice. Hopefully someday bug #59510 can really be closed.

Wednesday, November 08, 2006

Memory efficient DOM

A known problem with KOffice applications is the inability to cope with very large data (e.g. bug #59510). This is often related to filters, e.g. when you are importing Excel documents (see bug #85372). Fortunately, as I wrote there, this has been solved already as the filter now can produce large documents without causing the machine to consume every single bytes of RAM.

Still, a very large file can force the application to use hundred MB memory. One of the reason is because we use QDom to hold some DOMs associated with the document. Unfortunately, once the DOM gets very big, QDom would take too much memory space. And once some parts are paged out to swap, performance will degrade significantly. So when people claimed that opening 1 MB file (which is actually the compressed size, not the actual size of the XML document) consumes more than 300 MB memory, it is not exaggerated. After looking at some colorful charts below, you would hardly be astonished.

When the document is loaded, we do not need the DOMs anymore and these can be destroyed. Thus, although loading takes quite some time, once the loading process is finished, the application won't take so much memory anymore.

This topic has been discussed couple of times in the past. Last year, I started to work on a memory efficient implementation of DOM, designed to be used in KOffice. After some attempts, looks like it may after all work. And quite promising indeed.

So, how can this DOM thing be improved?

First, it's likely that we do not need to modify any elements once they are loaded from the file. The way it works is always like this: get the document, load some XML files from the storage (e.g. content.xml, settings.xml, styles.xml, ...), parse them and construct the application-specific objects from those XML files. For example, for a spreadsheet this means that we create all the sheets, rows, columns, cells, and other objects. When the application saves the document, it creates the DOM from scratch and write it to the file. Thus, when opening a file, the corresponding DOMs are not modified at all.

Second, many namespace URIs, attribute names and element values are duplicated. That alone is already a very good reason to cache common strings. For example, in OpenDocument spreadsheet, almost every cell have attributes named table:style-name and office:value-type so we do not need to duplicate them for every single cells.

The replacement for QDom, dubbed as KoXmlReader, is using these tricks. The whole DOM tree is stored in a very compact form (where the overhead per node is minimized and all duplicated strings are stored efficiently), not using all instances of nodes, elements, texts, etc. However, there are KoXmlNode, KoXmlElement, KoXmlDocument classes which correspond to their QDom counterparts. The key is, those DOM nodes are created (from the compact representation) only when we need it and afterwards they can be destroyed again. In short: it's constructed only on-demand. This way, compatibility with QDom classes is preserved (otherwise we need to modify and test thousand lines of code) while the memory requirement can be kept to minimum.

(Side note: yes, we considered using a SAX parser directly. But that means changing a lot of code and is prone to mistakes. And seems nobody is willing to do that. My personal feeling is that the gained performance is not worth the hassle, though but I'm not so sad if I'm wrong here).

How can this help? Say we're dealing with a spreadsheet. When we parse the first sheet we need all the nodes and elements belong to this sheet, but we don't care for nodes and elements associated for the other sheets. So we just load what we need (this loading even happens automagically). After we finish with it, we can move to the next sheet and at the same purge everything from the first sheet (cause we don't bother with it anymore).

I tried to make the API for node, element, text etc very much compatible to QDom. There is even a compiler define so that it just wraps all those QDom classes (to ease transition), which allowed Stefan to change all QDom references with KoXml in one amazing sweep. Unit tests - with more than 1000 checks - provide (hopefully) almost 100% coverage and should help catching bugs or incompatibilities as early as possible (also, I notice that I have one more example of what I wrote some time ago: the whole test suite is longer than code itself).

So how does it perform?

Obviously this is not a scientifically correct benchmark, but just to give some illustrations, here are few spreadsheet documents that I tested (actually I have tried many many documents, these are just some of them). The column Content specifies the size of the content.xml of each document. As you can see, once it's uncompressed, the file can be very big. Other columns are self-explained, I hope.

Document File Size Content Sheets Cells
permfall.ods 18 KB 92 KB 4 271
test.ods 18 KB 144 KB 13 1138
Customer_Solution.ods 90 KB 1478 KB 18 11735
Financial_Report.ods 143 KB 2947 KB 24 8913
bug85372.ods 247 KB 4754 KB 81 347193
Frequencies.ods 627 KB 18952 KB 3 185334

So what happen if we load each of this document, give it to our XML parser, construct the DOM tree and iterate every single cell in the document? We won't load it into KSpread, and we won't create any data from that. We would just treat it like a plain DOM representation of a spreadsheet.

Using Valgrind's Massif, here what I got when I processed one of them using all QDom classes (QDomDocument, QDomElement, etc). As you witness, the memory consumption skyrocketed to around 250 MB although the compressed ODS file is not even 1 MB. In reality (i.e. within KOffice), of course the situation is much worse because KOffice needs also some more megabytes for its own code and data.

And this is how it looks like when I patch QDom so that the strings are cached (get the simple patch, if you're curious). It does not seem to help too much in this case.

However, using KoXmlReader stuff (KoXmlDocument, KoXmlElement, etc), at most we need only around 40 MB (see the vertical axis).

And follows is the complete comparison chart for other documents. Note that I omit Frequencies.ods - which you can already enjoy from the graphs above - because the bars for that file would be very very long compared to the others.

(There are perhaps better ways to instrument memory usage, so please take my lame attempt above with a grain of salt).

So we can see that the family of KoXml classes can be useful. But, what are the catches? For sure, we can't change the DOM tree. That is quite clear. However, as I write in the beginning, this does not matter a lot for KOffice. In other words, you can't just simply substitute your QDom with this KoXmlReader. It is designed and implemented for a very specific use case. It's not even a fully compliant implementation.

Another disadvantage is that iterating over nodes is a bit slower than QDom. The good news: parsing is faster. This is because the whole process is likely memory bound. True, KoXmlReader has some processing overhead while packing and unpacking the nodes (i.e. constructing them on-the-fly). But it definitely also accesses only a fraction of RAM compared to QDom, and we all know how slow memory operation is.

A few runs on P4 1.8 GHz gives the following chart. The light color is for parsing, the darker is for iterating. Looks like KoXmlReader does not really lag behind.

This also makes me rather excited, I wonder whether my next attempt (using additional real-time compression, wait for the Part II of this writing) would still give an acceptable performance in term of speed.

There are still few issues which must be solved before this stuff can be fully utilized in the upcoming KOffice 2.0. I'm working on that. But just for appetizer, here is the memory chart running KSpread and opening one of the test document (Customer_Solution.ods). Using QDom, the peak was about 110 MB before it settled down to around 80 MB when the loading is completed. I have no idea what caused the deep spike in the middle, this must be investigated (Is it really from KSpread? Or possibly Valgrind?).

However, when KoXmlReader stuff is used instead of QDom, the heap allocation (during the parsing and iterating) touched just about 90 MB, as evidenced below:

Please do not assume that 80 MB is what KSpread 2.0 needs to work with such simple document. What I am using is a rather work-in-progress version, so it has lots of extra stuff. KOffice 2.0 is still very far from its release date and there are still lots of on-going work on optimizations.

To be continued...