@@ -2662,3 +2662,380 @@ TIP 9: In versions below 8.0, the planner will ignore your desire to
2662
2662
choose an index scan if your joining column's datatypes do not
2663
2663
match
2664
2664
2665
+ From pgsql-hackers-owner+M77964=pgman=candle.pha.pa.us@postgresql.org Tue Dec 27 06:30:14 2005
2666
+ Return-path: <pgsql-hackers-owner+M77964=pgman=candle.pha.pa.us@postgresql.org>
2667
+ Received: from ams.hub.org (ams.hub.org [200.46.204.13])
2668
+ by candle.pha.pa.us (8.11.6/8.11.6) with ESMTP id jBRDUET16504
2669
+ for <pgman@candle.pha.pa.us>; Tue, 27 Dec 2005 08:30:14 -0500 (EST)
2670
+ Received: from postgresql.org (postgresql.org [200.46.204.71])
2671
+ by ams.hub.org (Postfix) with ESMTP id 3BCBA67A57F
2672
+ for <pgman@candle.pha.pa.us>; Tue, 27 Dec 2005 09:30:14 -0400 (AST)
2673
+ X-Original-To: pgsql-hackers-postgresql.org@localhost.postgresql.org
2674
+ Received: from localhost (av.hub.org [200.46.204.144])
2675
+ by postgresql.org (Postfix) with ESMTP id 1DB3B9DC859
2676
+ for <pgsql-hackers-postgresql.org@localhost.postgresql.org>; Tue, 27 Dec 2005 09:29:47 -0400 (AST)
2677
+ Received: from postgresql.org ([200.46.204.71])
2678
+ by localhost (av.hub.org [200.46.204.144]) (amavisd-new, port 10024)
2679
+ with ESMTP id 89101-04
2680
+ for <pgsql-hackers-postgresql.org@localhost.postgresql.org>;
2681
+ Tue, 27 Dec 2005 09:29:50 -0400 (AST)
2682
+ X-Greylist: from auto-whitelisted by SQLgrey-
2683
+ Received: from svana.org (svana.org [203.20.62.76])
2684
+ by postgresql.org (Postfix) with ESMTP id 319839DC8A5
2685
+ for <pgsql-hackers@postgresql.org>; Tue, 27 Dec 2005 09:29:44 -0400 (AST)
2686
+ Received: from kleptog by svana.org with local (Exim 3.35 #1 (Debian))
2687
+ id 1ErEte-0001lm-00; Wed, 28 Dec 2005 00:29:46 +1100
2688
+ Date: Tue, 27 Dec 2005 14:29:46 +0100
2689
+ From: Martijn van Oosterhout <kleptog@svana.org>
2690
+ To: pgsql-hackers@postgresql.org
2691
+ Subject: [HACKERS] Proposed COLLATE implementation
2692
+ Message-ID: <20051227132941.GA32404@svana.org>
2693
+ Reply-To: Martijn van Oosterhout <kleptog@svana.org>
2694
+ MIME-Version: 1.0
2695
+ Content-Type: multipart/signed; micalg=pgp-sha1;
2696
+ protocol="application/pgp-signature"; boundary="W/nzBZO5zC0uMSeA"
2697
+ Content-Disposition: inline
2698
+ User-Agent: Mutt/1.3.28i
2699
+ X-PGP-Key-ID: Length=1024; ID=0x0DC67BE6
2700
+ X-PGP-Key-Fingerprint: 295F A899 A81A 156D B522 48A7 6394 F08A 0DC6 7BE6
2701
+ X-PGP-Key-URL: <http://svana.org/kleptog/0DC67BE6.pgp.asc>
2702
+ X-Virus-Scanned: by amavisd-new at hub.org
2703
+ X-Spam-Status: No, score=0.08 required=5 tests=[AWL=0.080]
2704
+ X-Spam-Score: 0.08
2705
+ X-Mailing-List: pgsql-hackers
2706
+ List-Archive: <http://archives.postgresql.org/pgsql-hackers>
2707
+ List-Help: <mailto:majordomo@postgresql.org?body=help>
2708
+ List-Id: <pgsql-hackers.postgresql.org>
2709
+ List-Owner: <mailto:pgsql-hackers-owner@postgresql.org>
2710
+ List-Post: <mailto:pgsql-hackers@postgresql.org>
2711
+ List-Subscribe: <mailto:majordomo@postgresql.org?body=sub%20pgsql-hackers>
2712
+ List-Unsubscribe: <mailto:majordomo@postgresql.org?body=unsub%20pgsql-hackers>
2713
+ Precedence: bulk
2714
+ Sender: pgsql-hackers-owner@postgresql.org
2715
+ Status: OR
2716
+
2717
+
2718
+ --W/nzBZO5zC0uMSeA
2719
+ Content-Type: text/plain; charset=us-ascii
2720
+ Content-Disposition: inline
2721
+ Content-Transfer-Encoding: quoted-printable
2722
+
2723
+ Greetings all,
2724
+
2725
+ If you're not interested in COLLATE, operator classes or related
2726
+ things, stop now, this is quite a long email.
2727
+
2728
+ Firstly, status. PostgreSQL doesn't really support collations at all.
2729
+ The order of strings is defined at initdb time by the locale then and
2730
+ cannot be changed later. We allow lists to be sorted in either
2731
+ ascending or decending order but that's about it. Whatever order there
2732
+ is is deduced from b-tree operator classes.
2733
+
2734
+ The purpose of this patch is to raise collations to (reasonably) first
2735
+ class object. The idea is that you can define a collation across any
2736
+ type and that you will then be able to ORDER BY, GROUP BY and INDEX
2737
+ using that collation. A collation defines both order and equality.
2738
+
2739
+ The SQL standard does define COLLATE although they only apply that to
2740
+ character strings. There are no predefined collations in the standard.
2741
+ There are rules about how collations and collation states propegate
2742
+ from the leaves of the parse tree all the way to the root. In its
2743
+ simplest form, columns and constants have defined collations which
2744
+ modify the behaviour of functions using these values. At any point in
2745
+ the parse tree the user can override the collation with an explicit
2746
+ <collate clause>. If there is ambiguity about what collation applies at
2747
+ any point for a function that needs to know, this is a error.
2748
+
2749
+ All this is parse-time analysis.
2750
+
2751
+ Proposed Implementation:
2752
+
2753
+ NODES
2754
+
2755
+ To implement the above, two new node types are created: CollateClause
2756
+ which represents the <collate clause> in SQL syntax, and CollateState
2757
+ which represents the actual state at any node. Currently the only nodes
2758
+ expected to require these are OpExpr, FuncExpr, Var and Const. Although
2759
+ I guess it may apply to any node that can be used in an expression.
2760
+
2761
+ CATALOG CHANGES
2762
+
2763
+ To track collations requires a new table in the catalog, which I have
2764
+ named pg_collations. It contains the following fields:
2765
+
2766
+ Oid oid; -- OID for this collation
2767
+ Name collname; -- Name of the collation (for collate clause)
2768
+ bool collasc; -- Ascending or descending
2769
+ Oid collopclass; -- Implementing Operator Class
2770
+ int4 colltype; -- Currently, 0=3Dsimple, 1=3Duses locale
2771
+ Oid colllocale; -- Locale in pg_locales
2772
+
2773
+ (Should we be identifying the type here? or is it ok to lookup the type
2774
+ via the operator class).
2775
+
2776
+ The first few fields name the collation so it can be referred to by the
2777
+ user. Then the collasc field determines how to use the operator class
2778
+ as given in the collopclass field. If it indicates descending order, it
2779
+ will invert the sense of the operator class. For example, asking for
2780
+ the GT op for a reverse collation will actually return the LT operator
2781
+ for the operator class.
2782
+
2783
+ The purpose of the colltype and colllocale fields are described further
2784
+ down.
2785
+
2786
+ The important thing at this point is that by specifying a collation you
2787
+ are also specifying an operator class. At the moment the ascending and
2788
+ descending collations for each type are hard-coded for initdb. At the
2789
+ moment they have been allocated OIDs starting at 2800, which is the
2790
+ first large available block.
2791
+
2792
+ Each column of a table has a default collation, which defaults to the
2793
+ default collation of the type but can be specified in the table
2794
+ declaration. To store this requires an additional column in
2795
+ pg_attribute (attcollate) which contains the OID of the collation for
2796
+ that column. When it is referenced in a query, this collation is copied
2797
+ to the CollateState node of the Var node, from whence it can affect the
2798
+ query.
2799
+
2800
+ Finally, to allow the parser to complain about ambiguous CollateStates,
2801
+ we need to indicate which functions actually need a senseble collate
2802
+ state to function. To this end a single boolean field has been added to
2803
+ pg_proc (proneedcollate). If this is true, the parser should error out
2804
+ when the collation state is COLLATE_NONE.
2805
+
2806
+ INDEXES
2807
+
2808
+ Another place you will be allowed to use the collate clause is while
2809
+ creating indexes. If you declare an index using a particular collation,
2810
+ it can be used in queries that order by the same collation. Note, that
2811
+ the collate clause indicates the operator class, so you can either
2812
+ specify one or the other, but not both.
2813
+
2814
+ So each column of an index will also have a collation. However,
2815
+ pg_attribute has already got an extra field to store the collation for
2816
+ columns so it makes sense to store the collation here. In the process
2817
+ the pg_index.indclass field becomes redundant as it can be inferred
2818
+ from the pg_attribute rows associated with the index.
2819
+
2820
+ To make this work there also needs to be a notion of compatability
2821
+ between collations. For example, two collations which are the reverse
2822
+ of eachother are compatable in the sense that an index defined with one
2823
+ collation would be usable for the other simply by scanning in reverse.
2824
+
2825
+ FUNCTIONS
2826
+
2827
+ In particular for string comparison but also possibly for user-defined
2828
+ types, a function will need to know what collation it is operating
2829
+ under. For this purpose an extra field (fn_collate) is added to
2830
+ FmgrInfo which is filled in with the collation from the expression tree
2831
+ (if any) or wherever relevent (eg. from the pg_attribute column when
2832
+ doing statistics or creating indexes).
2833
+
2834
+ A PG_GETCOLLATE() macro is added to facilitate user-functions
2835
+ retreiving this data. This function throws an error when no collation
2836
+ has been defined. This shouldn't happen in practice as issues should
2837
+ have been weeded out at parse-time.
2838
+
2839
+ This macro returns the OID of the collation but in many cases it will
2840
+ not be necessary. In particular, functions should NOT invert their
2841
+ result if the collation is inverted. It is considered the
2842
+ responsibility of the caller to invert the result if necessary. The
2843
+ reasons for this are:
2844
+
2845
+ 1. In most cases that matter (order comparison) the issue can be dealt
2846
+ with at parse time by the NEGATOR or COMMUTATOR options.
2847
+ 2. For index scans, we would just do a reverse scan instead (or forward
2848
+ if the index is inverted)
2849
+ 3. Requiring every function to check the collation for inversion is
2850
+ wasteful, since in many cases the case can be dealt with statically.
2851
+
2852
+ DEFAULT COLLATIONS
2853
+
2854
+ At this point I'm inclined to define a few collations to be built in or
2855
+ specially handled:
2856
+
2857
+ COLLATE ASC - default collation for type, ascending (ie, what we do now)
2858
+ COLLATE DESC - default collation for type, inverted
2859
+ COLLATE POSIX - For strings, define a simple bytewise string comparison.
2860
+
2861
+ Indeed, it is expected that by default, all columns involving strings
2862
+ in the system catalog will always use COLLATE POSIX. Additionaly, type
2863
+ "name" will always use that collation, even if the user changes the
2864
+ default (by a method to be specified). This is straightforwardly done
2865
+ at initdb time.
2866
+
2867
+ The purpose of COLLATE DESC is to simplify index declarations. Saying
2868
+
2869
+ CREATE INDEX foo ON bar( a COLLATE ASC, b COLLATE DESC );
2870
+
2871
+ would allow it to be used in a query using ORDER BY a, b DESC, without
2872
+ the user having to lookup the name of the collation.
2873
+
2874
+ When it comes to naming collations, the question arises whether
2875
+ ascending/descending collations need to have different names. Or
2876
+ should there be two collations with the same name with ASC/DESC as a
2877
+ modifier? Do collations have to be unique across different types; for
2878
+ example, can varchar and text both have a collation "ignorecase"?
2879
+
2880
+ Another issue is that a column could be declared with a descending
2881
+ collation by default. Say it was an integer column, then (a < 5) would
2882
+ return FALSE for a =3D 1. While technically correct, I'm thinking of
2883
+ ruling it out on the basis of being too confusing, and only allow
2884
+ descending collations at query time or in index specifications.
2885
+
2886
+ Another strange point at the moment is how to determine the default
2887
+ collation of a type. At the moment it is done by finding the default
2888
+ operator class and looking up the ascending version of that. However,
2889
+ we may want to provide the user a way of specifying it directly,
2890
+ perhaps by:
2891
+
2892
+ ALTER TYPE text SET DEFAULT COLLATION ignorecase;
2893
+
2894
+ PATHKEYS
2895
+
2896
+ Currently during planning, pathkeys are indicated by an operator of the
2897
+ operator class. Here we would simply replace that with the oid of the
2898
+ collation, which can be matched directly with the collation defined by
2899
+ the index.
2900
+
2901
+ USER DEFINED TYPES
2902
+
2903
+ None of this is interesting unless it can be applied to user-defined
2904
+ types also. Fortunatly this is easy, when the user declares a b-tree
2905
+ operator class, we can generate the collations automatically. We may
2906
+ even allow the user to specify the name of the collation. However, if
2907
+ the user wanted to create multiple collations based on the same
2908
+ operator class (by using the PG_GETCOLLATE() macro above, we may want
2909
+ to provide them a way of creating them directly.
2910
+
2911
+ COLLATIONS USING LOCALES
2912
+
2913
+ For strings, collation can be done in many different ways defined by
2914
+ what is referred to as a locale. As indicated above in the definition
2915
+ of pg_collations, there is a colltype field. For most collations this
2916
+ will be 0 (simple collation). However, for strings the intention is to
2917
+ use a type 1 (using locales). In this case the last column refers to
2918
+ the OID of the locale, so you can many collations using the same
2919
+ operator class, but different locale oids. On a system level it changes
2920
+ nothing, but inside the functions implementing it, they should use
2921
+ PG_GETLOCALE(). This will return an opaque pg_locale_t (see below)
2922
+ handle which can then be used to implement the specifics.
2923
+
2924
+ In principle, user-defined types need to be able to use this also,
2925
+ perhaps by using the clause COLLATE USING LOCALE in the operator class.
2926
+ In theory there should a collation for each combination of
2927
+ locale-dependant datatype, locale and order ascending/descending.
2928
+ How/when these are created has not yet be determined.
2929
+
2930
+ MORE TYPES OF COLLATION
2931
+
2932
+ Another collation type I've speculated about but not thought about
2933
+ implementing is a "mapping collation", in which you map the values
2934
+ through a function and then collate that. The obvious example would be
2935
+ a case-insensetive mapping where lower is applied before collation.
2936
+
2937
+ Implementation could be pretty much done by simply substituting the
2938
+ functions into the parse tree. For example, if you defined something
2939
+ like:
2940
+
2941
+ CREATE COLLATION ignorecase ON text USING lower($1) COLLATE defaulttext;
2942
+
2943
+ Then anytime you did a comparison with that collation, you would simply
2944
+ insert those function calls into the parse-tree and then collate with
2945
+ "defaulttext". When declaring an index you would just make it a
2946
+ functional index. The rules for functional indexes should make it work
2947
+ out-of-the-box.
2948
+
2949
+ OTHER TECHNICAL ISSUES
2950
+
2951
+ - Applying a COLLATE clause to an unknown literal causes it to be
2952
+ coerced to the type that collation is based on. But what about if we
2953
+ have something like COLLATE DESC?
2954
+
2955
+ - This requires some changes in the bootstrap procedures given that we
2956
+ need to be able to do lookups on the operator class for each type
2957
+ fairly early on. Some are predefined but it does require moving the
2958
+ opclass setup further up the list. However, if we store a default
2959
+ collation in pg_type, we wouldn't need to do that.
2960
+
2961
+ - Sorting arrays. Should they get their own collations, or should
2962
+ they use the collations of their base types.
2963
+
2964
+ LOCALES
2965
+
2966
+ I've left this to the end because I don't want people distracted by
2967
+ what is essentially a side-issue. While this would be needed to
2968
+ implement COLLATE the way the SQL spec intended, it can actually be
2969
+ implemented and dealt with as a seperate patch. The main reason a basic
2970
+ implementation exists is that it provides a great way of finding places
2971
+ that didn't define a collation, since any comparison involving "text"
2972
+ requires one.
2973
+
2974
+ To deal with locales I created another table in the catalog,
2975
+ pg_locales. This provides an OID which can be referenced from
2976
+ elsewhere, such as the pg_collations table.
2977
+
2978
+ The design is intended to provide some pluggability, so locale
2979
+ information can come from multiple sources. Also, each locale will be
2980
+ referenced by an identifier which is unrelated to any external
2981
+ identifier, so we're not bound by them.
2982
+
2983
+ The columns defined currently are:
2984
+
2985
+ Name locname - Identifier used by postgresql
2986
+ Name locsysname - String identifying the locale for the locale provider
2987
+ int4 locsource - System providing this locale
2988
+ int4 locencoding - Encoding expected by provider
2989
+
2990
+ It is expected that the list of sources for locale data will be short,
2991
+ probably hard-coded into the backend (currenty internal/system/icu).
2992
+ The only locale defined at startup is POSIX, which is implemented
2993
+ internally. The intention is for any other locales to be defined at the
2994
+ end of initdb. The expected syntax is something like:
2995
+
2996
+ CREATE LOCALE hungarian AS 'hu_HU' USING glibc;
2997
+
2998
+ This should use the provider to check the locale exists and has a
2999
+ conpatible encoding. If so it is entered into the table ready for use.
3000
+
3001
+ In the backend, there will be implementations of functions like
3002
+ pg_strcoll_l, pg_localeconv_l, which work like the C system library
3003
+ versions only they take an extra pg_locale_t argument. This is used to
3004
+ dispatch the call to the right place. There will be a function to
3005
+ quickly determine if a locale is C to shortcircuit complexity where it
3006
+ is not needed.
3007
+
3008
+ STATUS
3009
+
3010
+ Implementation so far is available here:
3011
+
3012
+ http://svana.org/kleptog/temp/collate-current.patch.gz
3013
+
3014
+ This patch isn't "clean" and changes a few things that are not strictly
3015
+ necessary. It won't finish initdb right now because it gets an error in
3016
+ ANALYSE (the array issue above).
3017
+
3018
+ Feedback, help, comments: please reply.
3019
+
3020
+ Have a nice day,
3021
+ --=20
3022
+ Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/
3023
+ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
3024
+ > tool for doing 5% of the work and then sitting around waiting for someone
3025
+ > else to do the other 95% so you can sue them.
3026
+
3027
+ --W/nzBZO5zC0uMSeA
3028
+ Content-Type: application/pgp-signature
3029
+ Content-Disposition: inline
3030
+
3031
+ -----BEGIN PGP SIGNATURE-----
3032
+ Version: GnuPG v1.0.6 (GNU/Linux)
3033
+ Comment: For info see http://www.gnupg.org
3034
+
3035
+ iD8DBQFDsUHDIB7bNG8LQkwRAnh0AJ0YUNLkVaSY3u0jWBPdlaq+9dujZACfTmis
3036
+ JR1mF60lKx14Ih850p3lpVk=
3037
+ =/Ghs
3038
+ -----END PGP SIGNATURE-----
3039
+
3040
+ --W/nzBZO5zC0uMSeA--
3041
+
0 commit comments