Category Archives: Web

URL parameters in Javascript

I wanted a piece of code in pure javascript ( no framework required ) that could extract the parameters in the query string part of an URL.

I wanted it to be able to extract the parameters in this format name[key]=value  like they are used in php applications.

I found a piece of code on some other blogs or forum posts but it didn't work as I expected so here is my take on this.

  1. span style="color: #3366CC;">'''&''='

This function one limitation: It doesn't work with multidimensional arrays. It's probably not hard to modify it to work like that but I only needed it to work with single dimension arrays.

7 Methods to cache web applications

The best web caching system is the one that allows visitors to use your site or web application without fetching anything from your server ... well almost anything.

By fetching as little as possible your server gets less hits so it minimizes the load and the need to acquire new hardware and complicated setups but this also improves user's experience a lot because the web application will load a lot faster since most files ( scripts, css, images ) are already on her/his disk.

The idea is to set such a high cache expiry time or ( max-age and other parameters ) that they ( browsers ) would not even want to look for newer versions for a long time ( like a year or more)

Here's what I learned recently when trying to optimize a big web application built on javascript  and php:

0) Page analysis

Before you get started get Page Speed or Yslow and do an analysis on your app/site then come back here and see how you can solve the caching problems listed there.

1) High cache age is good but what do you do when your site changes?

You definitely want to force the changes to your users right ?

Answer: version everything.

You may have noticed the way a lot of sites include scripts and css files with a version at the end like : jquery.js?ver=1232442

Here's how this works: The main page that includes this script is not cached so the visitor will load it every time but the browser caches the jquery.js?ver=1232442 url ( because you said so in your web server config ).

Now if you update jquery to a new version all you have to do is modify the url like jquery.js?ver=1232443 in the main page and the browser will know it has to fetch the jquery.js file again because from it's point of view it's a totally different file.

If you can use php in the template that outputs the page you could even do something like:
<script src="jquery.js?ver=<?=filemtime('jquery.js')?> . By doing this you don't have to worry to update the main page when you update jquery.js.

2) CSS/HTML rewriting.

So you do this versioning thing for javascripts and maybe css files but what do you do about images? How do you cache them and still make sure your visitors will always see the latest version?

Your images are referenced from the css files or HTML content. You probably already serve your HTML content through a script ( cms ? ) so you'll have to modify this script to automatically add the versioning string to each image or other static file you want cached.

For css do the same, serve it through a script and before you output modify the image paths or even better, especially if you use multiple css files, write a script that generates one file from all ( it loads faster this way ), does the rewrite, then minifies it, save it, then it compresses it and stores the compressed version so you can serve that when possible. You would have to run this script every time you change something to your css code.

3) HTTP proxies cache differently.

It is believed that most will not cache URLs with query strings in them like jquery.js?ver=122323

A HTTP proxy can minimize the hits on your server by fetching only once and distributing to more then one user but if you want to take advantage of that you have to use a different versioning scheme.

An idea is to insert the version before the file extension like: jquery-122323.js so the URLs would no look "dynamic" anymore.

If you do this and you don't actually want to rename all the files you could use some mod_rewrite rules to redirect anything matching that pattern to the actual files.

4) HTTPS is a different animal

Yeah browsers will not cache content that comes over https because it's considered a security issue. Imagine your app generates a pdf or image with sensitive user info and "says" yeah you can cache it for a year, and the user downloads it in a publicly available computer. The next user will get the same file. Of course this would happen over HTTP too so be careful with what you allow to be cached. The only difference with HTTPS is that the browser will disregard normal caching instructions if the file is served over HTTPS.

Now you would say "why would you even want to send generic scripts, css or images over https?" ...right ... Well you do because if you allow HTTPS access to your app and you don't send everything over HTTPS then the browsers would warn the user that not everything on the page is encrypted. Now some users wouldn't care especially if they know what the warning means or how to check what's not encrypted, but other's might freak out about it.

So if you want to send everything over HTTPS and you want the browser to cache the files you have to set  the header "Cache-control: public" but again....make sure you only set this for static files that are generic for all users.

And if you set Cache-control add the max-age to it otherwise if you only set "public" it might invalidate any other "max-age" set in other headers like "Expires". So the header should look like: Cache-control: public, max-age=31536000 ( cache even for HTTPS and authenticated (HTTP authentication) users for a year )

5) Gzip caching

If you're using apache then it's probably already using mod_deflate to compress static files when talking to browsers that accept deflate as the Content-encoding. This is good as it speeds up page loading a bit but this means that apache it's compressing the same content over and over for each visitor consuming your CPU time. And even if you do caching as mentioned above it will still compress for new visitors. So why not cache the compressed content once and server it to everybody?

To do that you'll have to use mod_gzip . This apache module will negotiate Content-encoding with browsers and if the browser supports it then it will send the compressed file instead of the non compressed one. mod_gzip will do even more, it will pre-compress the files so you don't have to do it yourself and it can figure out by itself when you updated the original file and it will regenerate the compressed version. mod_gzip can really save a lot of cpu time for your server.

6) Caching Dynamic content

This basically means generate static content from your dynamic one and save it on disk ( plain and compressed ... see #5 ) so apache or a script can serve it directly without having to go to the database or compute the results . Wp-super-cache does something like this for wordpress.

Since dynamic content is more likely to change often and it's most likely not referenced from other non cachable pages like images,css and JavaScript you can't set a high cache max age for it so you can't reduce the hits so much.

But if you serve it through a script that can easily ( cheaply ) determine that the content has not changed then that script can issue a "304 Not Modified" response and the browser will know that it already has the content. This may be a lot faster then actually regenerating the dynamic content and sending it to the client.

Here's how to do dynamic content caching in PHP

There's also a lot of caching that can be done at the database server level or before/after talking to the database server ( memcached ) but this is totally different topic.

What else ?

Did I miss anything ? If you know other techniques I'd love to read about them so feel free to hit the comments but not too hard as this blog doesn't do much of the caching discussed here 🙂

BTW: that big web app I mentioned at the beginning of this post is an email marketing service that I just launched in beta. If you run a blog and you think about sending a newsletter you might want to try it. Beta testers get some nice benefits.

Mod_rewrite quick tip

This may be obvious for some mod_rewrite experts but I spent a lot of time to figure it out and I get the feeling I hd this problem before and I forgot what the solution was so here it is:

Mod rewrite does NOT match your pattern on the query string but only on the path part of the URL.
To match the query string you must use the RewriteCond rule.

From mod_rewrite documentation:

Note: Query String

The Pattern will not be matched against the query string. Instead, you must use a RewriteCond with the %{QUERY_STRING} variable. You can, however, create URLs in the substitution string, containing a query string part. Simply use a question mark inside the substitution string, to indicate that the following text should be re-injected into the query string. When you want to erase an existing query string, end the substitution string with just a question mark. To combine a new query string with an old one, use the [QSA] flag.

That last  part about QSA was the one that made me rediscover this 🙂

Scour: The socially search engine

I just started using Scour, a search engine that let's you vote and comment on the results.

Scour queries the top 3 major search engines: Google, yahoo and live to provide results so it's like using your preferred search engine with a social twist. You can vote up or down each result and comment on it and then scour uses this data ( votes , comments ) to provide better  relevancy.

The problem is that when people want to search they want results quick and once they get them they just leave so in order to encourage users to contribute scour rewards them with points that can be converted in money using visa gift cards.

The idea is that since the major search engines are making billions from search the user should get something ( more then just search results ) out of it too.

Once you signed up to scour you can start using it for your daily search just like you did with google, yahoo or msn . They even have a search bar plugin for internet explorer and firefox  and in the faqs you can find instructions about how you can make firefox use it as the default search engine instead of google. There is also a toolbar but apparently it's only for internet explorer or only for windows ( .exe ) .

As you keep searching, voting, commenting you accumulate points. For each search you get 1 point, for each vote you get 2 points and for each comment you get 3 points but a maximum of 4 points / search  and once you reach 6500 points you get a $25 visa gift card.

I like Scour both for the ideas of higher relevancy through votes and comments and also for rewarding users.

Scour is still in the beginning and there are some small problems with it like : try searching for : 'var/log' or the fact that it only displays 3 pages of results, but I'm sure they will be fixed and the search engine will improve over time.

Of course the whole idea of better relevancy will work only if more users will sign up , use it regularly and contribute.

Firefox 3 beta 5 released

Mozilla released the 5'th beta of the Firefox 3 browser a few hours ago.

The new beta brings enhancements in the bookmark organizer, operating system integration and most important to me the speed in the javascript engine that so many sites depends on these days.

There are just 750 improvements since the last beta version, 250 fewer then the number of improvements between beta 3 and beta 4.

I think this will probably be  the last beta version before a release candidate even thou the "known issues" list is a bit larger then the one of the previous beta.

Here are the release notes and here is the download page for those of you that want to give it a try.

XML Sitemaps for Pligg

Update: There is be a new version of this module. Click here to get it.

I created a module that generates XML Sitemaps for Pligg ( the well known open source cms used for creating sites similar to digg.com ).

The module generates a sitemap index and sitemaps with all the stories in the database dynamically so nothing is stored on disk and you don't have to set a cron job to generate it.

The sitemaps are updated automatically when a new story is submitted. Because of the structure of the sitemap index and because it contains "lastmod" info, the search engines should only download the latest entries in the index so you shouldn't worry about the module putting too much load on your system.

There is also a "ping" function that will announce google, yahoo and ask.com every time a new story is submitted so that they know they have to download the sitemap. The ping function required a little patching to pligg source code to add some hooks ( only if you use 9.6, 9.7 should already have those hooks ). Here is the diff file in case you use pligg 0.9.6 : pligg submit hooks diff

The module was only tested on pligg 0.9.6, I haven't upgraded to 0.9.7 yet, so if you try this on 0.9.7 let me know how it works, any feedback is appreciated.

Download:

You can download Xml_Sitemaps module from here: xml_sitemaps-0.1.tar.gz and in case you want to verify it here is the md5 sum and the sha256 sum

the code is released under the same license as pligg, so feel free to use it, modify and share.

Installation:

This is pretty straight forward, you have to install this like any other pligg module, just put the .tar.gz file in the modules, un-archive it then go into pligg admin and activate it. If you use pligg 0.9.6 and you want to be able to ping the search engines don't forget to apply the submit hooks patch .

Configuration:

After installation you should be able to access the sitemap index like this : http://yourdomain.com/module.php?module=xml_sitemaps_show_sitemap or if you want the sitemap to look friendly ( btw ask.com will only accept a friendly sitemap ending in .xml ) , you just have to go into Admin->Configuration->XmlSitemaps and enable "Sitemap Friendly URL", and if you do that then you have to put the following lines in your .htaccess somewhere before the line "##### URL Method 2 ("Clean" URLs) Begin #####" :

  1. RewriteRule ^sitemapindex.xml module.php?module=xml_sitemaps_show_sitemap [L]
  2. RewriteRule ^sitemap-([a-zA-Z0-9]+).xml module.php?module=xml_sitemaps_show_sitemap&i=$1 [L]

Here is how the index looks on a site with sitemap friendly urls enabled: http://sapa.ro/sitemapindex.xml

There are other configuration options in there, you can set the maximum number of stories to put in a sitemap, and you can chose whether to ping any of the three search engines supported. You can also set your yahoo.com key in there if you want to ping yahoo.

That's it! Happy Sitemapping! and as always ... let me know how it works in the comments.

No browser supporting socks5 authentication?

If you're trying to use a socks server with Internet Explorer , Firefox, Opera or Safari everything will work just fine, except for authentication.

From my point of view this is a big problem. Who in the world would leave such a proxy server unprotected? Yeah of course you can always limit access to a proxy server based on ip address, but in some cases ( see NAT ) this is just not going to work.

Internet explorer supports only the socks4 protocol which doesn't even support full password authentication ( only username and it defaults to the current logged in username ) .

Firefox supports socks5 but no authentication mechanism so supporting socks5 is pretty much useless. I think I saw some ticket in bugzilla about this but no one managed to commit a fix yet.

Opera doesn't even support socks protocol but I thought I should mention all major browsers 🙂

Safari supports SOCKS5 and even allows you to set a username and password to access the SOCKS server but it does not use them.

I tried Konqueror, but I was unable to specify the Socks server, I guess this is because it was not compiled with a socks library.  Has anyone had any success with Konqueror and Socks ?

How to write about Linux for Digg?

I can't say I really know the answer to this question as none of my articles reached the front page, and I don't think they will ever be there mainly because digg audience doesn't care much about the type of content I write, but check out this site www.venturecake.com.

The site has only 11 articles and 6 of them reached the front page on digg.com. Venturecake.com is a blog about technology, mainly open source, Linux, Unix, Apple, and some others. The last post ( Who copied who? ) was published yesterday and it got over 600 diggs in one day.

The posts that made it to digg's front page are about common buzz words like Apple, Web 2.0, ( Web 2.0 is built on Open Source ), Open Source ( yes this is still a buzz word ), Ubuntu and Virtualization ( 15 minutes to using your existing Windows install & apps in Ubuntu , 10 minutes to run every Windows app on your Ubuntu desktop ) but also some unique tips like
10 Linux shell tricks you don’t already know. Really, we swear.

Google set to kill link ads

Google has a way of reporting paid links now. They say buying links is an attempt to game their pagerank algorithm and they want you to report sites that sell or buy paid links.

They agree links are a good way of advertising and are not against it, but they want those that display text link to put the rel="nofollow" attribute in the links. Using the "nofollow" attribute will means that GoogleBot will not follow the link, thus will not use it when computing the page rank for the destination url.

I think the only reason you would want text links on a site is because of that, to get a higher page rank and relevance, so by requiring webmasters to use nofollow, they are just killing text link advertising networks like Text Link Ads that work especially because they sell link ads that are followed and transfer page rank.

Google says this violates their guidelines. How can you violate a guideline, you can violate a rule, but if it's just a guideline that means you shouldn't be penalized for not following it.
And there are other problems with this policy. Links are supposed to mean that the owner of the site thinks that some other site is relevant, and that is why he links to it. Paid or not it can be relevant. Page rank is about relevancy, right ?
If I want my site in google search ads, I pay google for it, does that mean my ads are not relevant ? Google says it shows contextual ads because they are relevant to the content the user is seeing. It seems to me, it's relevant only if you pay google for the ad.

And here's another problem: How can google tell if the person that reports such violation does not lie? If I want to get my competitor out of google index or set him on a lower page rank I could just report him for buying text links. A lot of web sites have text links pointing to them, paid or not. It's hard to tell. Some disclose them, other's don't. This may influence the ones that do disclose them, not to disclose the links anymore. Why add to the risk of being reported?