Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
summaryrefslogtreecommitdiff
blob: 48eb7fe84ea93e61166d7b689582cce2b2a1720c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
<!-- doc/src/sgml/tablesample-method.sgml -->

<chapter id="tablesample-method">
 <title>Writing A TABLESAMPLE Sampling Method</title>

 <indexterm zone="tablesample-method">
  <primary>tablesample method</primary>
 </indexterm>

 <para>
  The <command>TABLESAMPLE</command> clause implementation in
  <productname>PostgreSQL</> supports creating a custom sampling methods.
  These methods control what sample of the table will be returned when the
  <command>TABLESAMPLE</command> clause is used.
 </para>

 <sect1 id="tablesample-method-functions">
  <title>Tablesample Method Functions</title>

  <para>
   The tablesample method must provide following set of functions:
  </para>

  <para>
<programlisting>
void
tsm_init (TableSampleDesc *desc,
         uint32 seed, ...);
</programlisting>
   Initialize the tablesample scan. The function is called at the beginning
   of each relation scan.
  </para>
  <para>
   Note that the first two parameters are required but you can specify
   additional parameters which then will be used by the <command>TABLESAMPLE</>
   clause to determine the required user input in the query itself.
   This means that if your function will specify additional float4 parameter
   named percent, the user will have to call the tablesample method with
   expression which evaluates (or can be coerced) to float4.
   For example this definition:
<programlisting>
tsm_init (TableSampleDesc *desc,
          uint32 seed, float4 pct);
</programlisting>
Will lead to SQL call like this:
<programlisting>
... TABLESAMPLE yourmethod(0.5) ...
</programlisting>
  </para>

  <para>
<programlisting>
BlockNumber
tsm_nextblock (TableSampleDesc *desc);
</programlisting>
   Returns the block number of next page to be scanned. InvalidBlockNumber
   should be returned if the sampling has reached end of the relation.
  </para>

  <para>
<programlisting>
OffsetNumber
tsm_nexttuple (TableSampleDesc *desc, BlockNumber blockno,
               OffsetNumber maxoffset);
</programlisting>
   Return next tuple offset for the current page. InvalidOffsetNumber should
   be returned if the sampling has reached end of the page.
  </para>

  <para>
<programlisting>
void
tsm_end (TableSampleDesc *desc);
</programlisting>
   The scan has finished, cleanup any left over state.
  </para>

  <para>
<programlisting>
void
tsm_reset (TableSampleDesc *desc);
</programlisting>
   The scan needs to rescan the relation again, reset any tablesample method
   state.
  </para>

  <para>
<programlisting>
void
tsm_cost (PlannerInfo *root, Path *path, RelOptInfo *baserel,
          List *args, BlockNumber *pages, double *tuples);
</programlisting>
   This function is used by optimizer to decide best plan and is also used
   for output of <command>EXPLAIN</>.
  </para>

  <para>
   There is one more function which tablesampling method can implement in order
   to gain more fine grained control over sampling. This function is optional:
  </para>

  <para>
<programlisting>
bool
tsm_examinetuple (TableSampleDesc *desc, BlockNumber blockno,
                  HeapTuple tuple, bool visible);
</programlisting>
   Function that enables the sampling method to examine contents of the tuple
   (for example to collect some internal statistics). The return value of this
   function is used to determine if the tuple should be returned to client.
   Note that this function will receive even invisible tuples but it is not
   allowed to return true for such tuple (if it does,
   <productname>PostgreSQL</> will raise an error).
  </para>

  <para>
  As you can see most of the tablesample method interfaces get the
  <structname>TableSampleDesc</> as a first parameter. This structure holds
  state of the current scan and also provides storage for the tablesample
  method's state. It is defined as following:
<programlisting>
typedef struct TableSampleDesc {
    HeapScanDesc    heapScan;
    TupleDesc       tupDesc;

    void           *tsmdata;
} TableSampleDesc;
</programlisting>
  Where <structfield>heapScan</> is the descriptor of the physical table scan.
  It's possible to get table size info from it. The <structfield>tupDesc</>
  represents the tuple descriptor of the tuples returned by the scan and passed
  to the <function>tsm_examinetuple()</> interface. The <structfield>tsmdata</>
  can be used by tablesample method itself to store any state info it might
  need during the scan. If used by the method, it should be <function>pfree</>d
  in <function>tsm_end()</> function.
  </para>
 </sect1>

</chapter>