Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Cav 12 Num Synth

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Synthesizing Number Transformations from

Input-Output Examples
Rishabh Singh
1
and Sumit Gulwani
2
1
MIT CSAIL, Cambridge, MA, USA
2
Microsoft Research, Redmond, WA, USA
Abstract. Numbers are one of the most widely used data type in pro-
gramming languages. Number transformations like formatting and round-
ing present a challenge even for experienced programmers as they nd
it dicult to remember dierent number format strings supported by
dierent programming languages. These transformations present an even
bigger challenge for end-users of spreadsheet systems like Microsoft Excel
where providing such custom format strings is beyond their expertise. In
our extensive case study of help forums of many programming languages
and Excel, we found that both programmers and end-users struggle with
these number transformations, but are able to easily express their intent
using input-output examples.
In this paper, we present a framework that can learn such number trans-
formations from very few input-output examples. We rst describe an
expressive number transformation language that can model these trans-
formations, and then present an inductive synthesis algorithm that can
learn all expressions in this language that are consistent with a given
set of examples. We also present a ranking scheme of these expressions
that enables ecient learning of the desired transformation from very
few examples. By combining our inductive synthesis algorithm for num-
ber transformations with an inductive synthesis algorithm for syntactic
string transformations, we are able to obtain an inductive synthesis al-
gorithm for manipulating data types that have numbers as a constituent
sub-type such as date, unit, and time. We have implemented our algo-
rithms as an Excel add-in and have evaluated it successfully over several
benchmarks obtained from the help forums and the Excel product team.
1 Introduction
Numbers represent one of the most widely used data type in programming lan-
guages. Number transformations like formatting and rounding present a chal-
lenge even for experienced programmers. First, the custom number format strings
for formatting numbers are complex and take some time to get accustomed to,
and second, dierent programming languages support dierent format strings,
which makes it dicult for programmers to remember each variant.

Work done during an internship at Microsoft Research.


Number transformations present an even bigger challenge for end-users: the
large class of users who do not have a programming background but want to cre-
ate small, often one-o, applications to support business functions [4]. Spread-
sheet systems like Microsoft Excel support a nite set of commonly used number
formats and also let users write their own custom formats using a number for-
matting language similar to that of .Net. This hard-coded set of number formats
is often insucient for the users needs and providing custom number formats
is typically beyond their expertise. This leads them to solicit help on various
online help forums, where experts typically respond with the desired formulas
(or scripts) after few rounds of interaction, which spans over a few days.
In an extensive case study of help forums of many programming languages
and Excel, we found that even though both programmers and end-users struggled
while performing these transformations, they were able to easily express their
intent using input-output examples. In fact, in some cases the initial English
description of the task provided by the users on forums was inaccurate and
only after they provided a few input-output examples, the forum experts could
provide the desired code snippet.
In this paper, we present a framework to learn number formatting and round-
ing transformations from a given set of input-output examples. We rst describe
a domain-specic language for performing number transformations and an in-
ductive synthesis algorithm to learn the set of all expressions that are consistent
with the user-provided examples. The key idea in the algorithm is to use the
interval abstract domain [2] to represent a large collection of consistent format
expressions symbolically, which also allows for ecient intersection, enumera-
tion, and execution of these expressions. We also present a ranking mechanism
to rank these expressions that enables ecient learning of the desired transfor-
mation from very few examples.
We then combine the number transformation language with a syntactic string
transformation language [6] and present an inductive synthesis algorithm for the
combined language. The combined language lets us model transformations on
strings that represent data types consisting of number as a constituent sub-
type such as date, unit, time, and currency. The key idea in the algorithm is
to succinctly represent an exponential number of consistent expressions in the
combined language using a Dag data structure, which is similar to the Bdd [1]
representation of Boolean formulas. The Dag data structure consists of program
expressions on the edges (as opposed to Boolean values on Bdd edges). Simi-
lar to the Bdds, our data structure does not create a quadratic blowup after
intersection in practice.
We have implemented our algorithms both as a stand-alone binary and as
an Excel add-in. We have evaluated it successfully on over 50 representative
benchmark problems obtained from help forums and the Excel product team.
This paper makes the following key contributions:
We develop an expressive number transformation language for performing
number formatting and rounding transformations, and an inductive synthesis
algorithm for learning expressions in it.
We combine the number transformation language with a syntactic string
transformation language to manipulate richer data types.
We describe an experimental prototype of our system with an attractive
user interface that is ready to be deployed. We present the evaluation of our
system over a large number of benchmark examples.
2 Motivating Examples
We motivate our framework with the help of a few examples taken from Excel
help forums.
Example 1 (Date Manipulation). An Excel user stated
that, as an unavoidable outcome of data extraction
from a software package, she ended up with a series
of dates in the input column v
1
as shown in the table.
She wanted to convert them into a consistent date
format as shown in the output column such that
both month and day in the date are of two digits.
Input v1 Output
1112011 01/11/2011
12012011 12/01/2011
1252010 01/25/2010
11152011 11/15/2011
It turns out that no Excel date format string matches the string in input
column v
1
. The user struggled to format the date as desired and posted the
problem on a help forum. After a few rounds of interactions (in which the user
provided additional examples), the user managed to obtain the following formula
for performing the transformation:
TEXT(IF(LEN(A1)=8,DATE(RIGHT(A1,4),MID(A1,3,2),LEFT(A1,2)),
DATE(RIGHT(A1,4),MID(A1,2,2),LEFT(A1,1))),"mm/dd/yyyy")
In our tool, the user has to provide only the rst two input-output examples from
which the tool learns the desired transformation, and executes the synthesized
transformation on the remaining strings in the input column to produce the
corresponding outputs (shown in bold font for emphasis).
We now briey describe some of the challenges involved in learning this trans-
formation. We rst require a way to extract dierent substrings of the input date
for extracting the day, month, and year parts of the date, which can be performed
using the syntactic string transformation language [6]. We then require a number
transformation language that can map 1 to 01, i.e. format a number to two dig-
its. Consider the rst input-output example 1112011 -> 01/11/2011. The rst
two characters in the output string can be obtained by extracting 1 from the in-
put string from any of the ve locations where it occurs, and formatting it to 01
using a number format expression. Alternatively, the rst 0 in the output string
can also be a constant string or can be obtained from the 3
rd
last character in
the input. All these dierent choices for each substring of the output string leads
to an exponential number of choices for the complete transformation. We use an
ecient data structure for succinctly representing such exponential number of
consistent expressions in polynomial space.
Example 2 (Duration Manipulation). An Excel user
wanted to convert the raw data in the input col-
umn to the lower range of the corresponding 30-min
interval as shown in the output column. An expert
responded by providing the following macro, which
is quite unreadable and error-prone.
Input v1 Output
0d 5h 26m 5:00
0d 4h 57m 4:30
0d 4h 27m 4:00
0d 3h 57m 3:30
FLOOR(TIME(MID(C1,FIND(" ",C1)+1,FIND("h",C1)- FIND(" ",C1)-1)+0,
MID(C1,FIND("h",C1)+2,FIND("m",C1)-FIND("h",C1)-2)+0,0)*24,0.5)/24
Our tool learns the desired transformation using only the rst two examples.
In this case, we rst need to be able to extract the hour and minute components
of the duration in input column v
1
, and then perform a rounding operation on
the minute part of the input to round it to the lower 30-min interval.
3 Overview of the Synthesis Approach
In this section, we dene the formalism that we use in the paper for developing
inductive synthesizers [8].
Domain-specic language: We develop a domain-specic language L that is
expressive enough to capture the desired tasks and, at the same time, is concise
for enabling ecient learning from examples.
Data structure for representing a set of expressions: The number of
expressions that are consistent with a given input-output example can potentially
be very large. We, therefore, develop an ecient data structure D that can
succinctly represent a large number of expressions in L.
Synthesis algorithm: The synthesis algorithm Synthesize consists of the fol-
lowing two procedures:
GenerateStr: The GenerateStr procedure learns the set of all expressions in
the language L (represented using the data structure D) that are consistent
with a given input-output example (
i
, s
i
). An input state holds values for
m string variables v
1
, . . ., v
m
(denoting m input columns in a spreadsheet).
Intersect: The Intersect procedure intersects two sets of expressions to
compute the common set of expressions.
The synthesis algorithm Synthesize takes as input a set of input-output
Synthesize((
1
, s
1
), . . . , (
n
, s
n
))
P := GenerateStr(
1
, s
1
);
for i = 2 to n:
P

:= GenerateStr(
i
, s
i
);
P := Intersect(P, P

);
return P;
examples and generates a set of expressions
in L that are consistent with them. It uses
GenerateStr procedure to generate a set of
expressions for each individual input-output
example and then uses the Intersect pro-
cedure to intersect the corresponding sets to
compute the common set of expressions.
Ranking: Since there are typically a large number of consistent expressions for
each input-output example, we rank them using the Occams razor principle that
states that smaller and simpler explanations are usually the correct ones. This
enables users to provide only a few input-output examples for quick convergence
to the desired transformation.
4 Number Transformations
In this section, we rst describe the number transformation language L
n
that can
perform formatting and rounding transformations on numbers. We then describe
an ecient data structure to succinctly represent a large number of expressions
in L
n
, and present an inductive synthesis algorithm to learn all expressions in
the language that are consistent with a given set of input-output examples.
4.1 Number Transformation Language L
n
Expr. en := Dec(u, 1, f)
[ Exp(u, 1, f, 2)
[ Ord(u)
[ Word(u)
[ u
Dec. Fmt. f := (, ) [
Number u := vi
[ Round(vi, r)
Round Fmt. r := (z, , m)
Mode m := [ [
Num. Fmt. := (, , )
[[Dec(u, 1, f)]] = [[(Int([[u]])
R
, 1)]]
R
[[f]]
[[Exp(u, 1, f, 2)]] = [[(Int([[u]])
R
, 1)]]
R

[[f]] [[(E([[u]])
R
, 2)]]
R

[[Ord(u)]] = numToOrd([[u]])
[[Word(u)]] = numToWord([[u]])
[[(, )]] = [[]] [[(Frac([[u]]), )]]
[[]] =
[[vi]] = (vi)
[[Round(vi, r)]] = RoundNumber((vi), z, , m)
where r = (z, , m)
[[(d, )]] = FormatDigits(d, , , )
where = (, , )
(a) (b)
Fig. 1. The (a) syntax and (b) semantics of the number transformation language Ln.
The variable vi denotes an input number variable, z, , , , and are integer constants,
and denotes the concatenation operation.
The syntax of the number transformation language L
n
is shown in Fig-
ure 1(a). The top-level expression e
n
of the language denotes a number for-
matting expression of one of the following forms:
Dec(u,
1
, f): formats the number u in decimal form (e.g. 1.23), where
1
denotes the number format for the integer part of u (Int(u)), and f represents
the optional format consisting of the decimal separator and the number format
for the fractional part (Frac(u)).
Exp(u,
1
, f,
2
): formats the number u in exponential form (e.g. 1.23E+2). It
consists of an additional number format
2
as compared to the decimal format
expression, which denotes the number format of the exponent digits of u.
RoundNumber(n,z,,m)
1 n

:=

n z

+ z;
2 if (n = n

) return n;
3 if (m =) return n

+ ;
4 if (m =) return n

;
5 if (m = (n n

) 2 < )
return n

;
6 if (m = (n n

) 2 )
return n

+ ;
FormatDigits(d,,,)
1 if (len(d) )
2 return significant(d, );
3 else if (len(d) )
4 |z := 0; s := 0;
5 else |s := Min(, len(d));
6 z := len(d) s;
7 return concat(d, 0z,

s
);
(a) (b)
Fig. 2. The functions (a) RoundNumber for rounding numbers and (b) FormatDigits
for formatting a digit string
Ord(u): formats the number u in ordinal form, e.g. it formats the number 4
to its ordinal form 4
th
.
Word(u): formats the number u in word form, e.g. it formats the number 4 to
its word form four.
The number u can either be an input number variable v
i
or a number obtained
after performing a rounding transformation on an input number. A rounding
transformation Round(v
i
, z, , m) performs the rounding of number present in v
i
based on its rounding format (z, , m), where z denotes the zero of the rounding
interval, denotes the interval size of the rounding interval, and m denotes one
of the rounding mode from the set of modes upper(), lower(), nearest().
We dene a digit string d to be a sequence of digits with trailing whitespaces.
A number format of a digit string d is dened by a 3-tuple (, , ), where
denotes the minimum number of signicant digits and trailing whitespaces of d
in the output string, denotes the maximum number of signicant digits of d in
the output string, and denotes the maximum number of trailing whitespaces in
the output string. A number format, thus, maintains the invariant: .
The semantics of language L
n
is shown in Figure 1(b). A digit string d
is formatted with a number format (, , ) using the FormatDigits function
shown in Figure 2(b). The FormatDigits function returns the rst digits
of the digit string d (with appropriate rounding) if the length of d is greater
than the maximum number of signicant digits to be printed. If the length
of d is lesser than but greater than the minimum number of signicant digits
to be printed, it returns the digits itself. Finally, if the length of d is less
than , it appends the digit string with appropriate number of zeros (z) and
whitespaces (s) as computed in Lines 5 and 6. The semantics of the rounding
transformation is to perform the appropriate rounding of number denoted by v
i
using the RoundNumber function shown in Figure 2(a). The function computes a
number n

which lies on the number line dened by zero z with unit separation
as shown in Figure 3. It returns the value n

or (n

+) based on the rounding


mode m and the distance between n and n

as described in Figure 2(a).


The semantics of a decimal form formatting expression on a number u is
to concatenate the reverse of the string obtained by formatting the reverse of
z z z +
n
n

Fig. 3. The RoundNumber function rounding-o number n to n

or n

+
integral part Int(u) with the string obtained from the decimal format f. Since
the FormatDigits function adds only trailing zeros and whitespaces to format
a digit string, the formatting of the integer part of u is performed on its reverse
digit string and the resulting formatted string is reversed again before performing
the concatenation. The semantics of decimal format f is to concatenate the
decimal separator with the string obtained by formatting the fractional part
Frac(u). The semantics of exponential form formatting expression is similar to
that of the decimal form formatting expression and the semantics of ordinal form
and word form formatting expressions is to simply convert the number u into its
corresponding ordinal form and word form respectively.
We now present some examples taken from various help forums that can be
represented in the number transformation language L
n
.
Example 3. A python programmer posted a query on
the StackOverflow forum after struggling to print
double values from an array of doubles (of dierent
lengths) such that the decimal point for each value
is aligned consistently across dierent columns. He
posted an example of the desired formatting as shown
on the right. He also wanted to print a single 0 after
the decimal if the double value had no decimal part.
Input v1 Output
3264.28 3264.28
53.5645 53.5645
235 235.0
5.23 5.23
345.213 345.213
3857.82 3857.82
536 536.0
The programmer started the post saying This should be easy. An expert replied
that after a thorough investigation, he couldnt nd a way to perform this task
without some post-processing. The expert provided the following python snippet
that pads spaces to the left and zeros to the right of the decimal, and then
removes trailing zeros:
ut0 = re.compile(r(\d)0+$)
thelist = textwrap.dedent(
\n.join(ut0.sub(r\1, "%20f" % x) for x in a)).splitlines()
print \n.join(thelist)
This formatting transformation can be represented in L
n
as Dec(v
1
,
1
, (.,
2
)),
where
1
(4, , 4) and
2
(4, , 3).
Example 4. This is an interesting post taken from a help
forum where the user initially posted that she wanted to
round numbers in an excel column to nearest 45 or 95,
but the examples later showed that she actually wanted
to round it to upper 45 or 95.
Input v1 Output
11 45
32 45
46 95
1865 1895
Some of the solutions suggested by experts were:
=Min(Roundup(A1/45,0)*45,Roundup(A1/95,0)*95)
=CEILING(A1+5,50)-5
=A1-MOD(A1,100)+IF(MOD(A1,100)>45,95,45)
This rounding transformation can be expressed in our language as:
Dec(Round(v
1
, (45, 50, )), (0, , 0), ).
4.2 Data structure for a set of expressions in L
n
Figure 4 describes the syntax and semantics of the data structure for succinctly
representing a set of expressions from language L
n
. The expressions e
n
are now
associated with a set of numbers u and a set of number formats . We represent
the set of numbers obtained after performing rounding transformation in two
ways: Round(v
i
, r) and Round(v
i
, n
p
), which we describe in more detail in sec-
tion 4.3. The set of number formats are represented using a 3-tuple (i
1
, i
2
, i
3
),
where i
1
, i
2
and i
3
denote a set of values of , and respectively using an
interval domain. This representation lets us represent O(n
3
) number of number
format expressions in O(1) space, where n denotes the length of each interval.
The semantics of evaluating the set of rounding transformations Round(v
i
, r)
is to return the set of results of performing rounding transformation on v
i
for all
rounding formats in the set r. The expression Round(v
i
, (n
1
, n

1
)) represents an
innite number of rounding transformations (as there exists an innite number
of rounding formats that conform to the rounding transformation n
1
n

1
).
For evaluating this expression, we select one conforming rounding format with
z = 0, = n

1
and an appropriate m as shown in the gure. The evaluation of a
set of format strings = (i
1
, i
2
, i
3
) on a digit string d returns a set of values, one
for each possible combination of i
1
, i
2
and i
3
. Similarly, we obtain
a set of values from the evaluation of expression e
n
.
4.3 Synthesis Algorithm
Procedure GenerateStr
n
: The algorithm GenDFmt in Figure 5 takes as input
two digit sequences d
1
and d
2
, and computes the set of all number formats
that are consistent for formatting d
1
to d
2
. The algorithm rst converts the digit
sequence d
1
to its canonical form d

1
by removing trailing zeros and whitespaces
from d
1
. It then compares the lengths l
1
of d

1
and l
2
of d
2
. If l
1
is greater
than l
2
, then we can be sure that the digits got truncated and can therefore set
the interval for i
2
(the maximum number of signicant digits) to be [l
2
, l
2
]. The
intervals for and are set to [0, l
2
] because of the number format invariant. On
the other hand if l
1
is smaller than l
2
, we can be sure that the least number of
signicant digits need to be l
2
, i.e. we can set the interval i
1
to be [l
2
, l
2
]. Also,
we can set the interval i
2
to [l
2
, ] because of the number format invariant. For
interval i
3
, we either set it to [, ] (when l
2
,= l
1
) or [, l
2
] (when l
2
= l
1
)
where denotes the number of trailing spaces in d
2
. In the former case, we can
be sure about the exact number of trailing whitespaces to be printed.
en := Dec( u, 1,

f)
[ Exp( u, 1,

f, 2)
[ Ord( u)
[ Word( u)
[ u

f := (, ) [
u := vi
[ Round(vi, r)
[ Round(vi, np)
Pair np := (n1, n

1
)
:= (i1, i2, i3)
Interval i := (l, h)
[[Dec( u, 1,

f)]] = |Dec(u, 1, f) [ u u, 1 1, f

f
[[Exp( u, 1,

f, 2)]] = |Exp(u, 1, f, 2) [ u u, 1 1,
f

f, 2 2
[[Ord( u)]] = |Ord(u) [ u u
[[Word( u)]] = |Word(u) [ u u
[[(,

f)]] = |(, f) [ f

f
[[]] =
[[vi]] = |vi
[[Round(vi, r)]] = |Round(vi, (z, , m)) [ (z, , m) r
[[Round(vi, np)]] = |Round(vi, (0, n

1
, m)) [ np (n1, n

1
),
if (n1 n

1
) m else m
[[(d, (i1, i2, i3))]] = |(d, , , ) [ i1, i2, i3
(a) (b)
Fig. 4. The (a) syntax and (b) semantics of a data structure for succinctly representing
a set of expressions from language Ln.
The GenerateStr
n
algorithm in Figure 5 learns the set of all expressions
in L
n
that are consistent with a given input-output example. The algorithm
searches over all input variables v
i
to nd the inputs from which the output
number n

can be obtained. It rst converts the numbers (v


i
) and n

to their
canonical forms n
c
and n

c
respectively in Line 3. We dene canonical form of
a number to be its decimal value. If the two canonical forms n
c
and n

c
are not
equal, the algorithm tries to learn a rounding transformation such that n
c
can
be rounded to n

c
. We note that there is not enough information present in one
input-output example pair to learn the exact rounding format as there exists an
innite family of such formats that are consistent. Therefore, we represent such
rounding formats symbolically using the input-output example pair (n
c
, n

c
),
which gets concretized by the Intersect method in Figure 6. The algorithm
then normalizes the number (u) with respect to n

using the Normalize method


in Line 6 to obtain n = (n
i
, n
f
, n
e
) such that both n and n

are of the same form.


For decimal and exponential forms, it learns a set of number formats for each of
its constituent digit strings from the pairs (n
R
i
, n
R
i
), (n
f
, n

f
), and (n
R
e
, n
R
e
) where
n
R
i
denotes the reverse of digit string n
i
. As noted earlier, we need to learn the
number format on the reversed digit strings for integer and exponential parts.
For ordinal and word type numbers, it simply returns the expressions to compute
ordinal and word forms of the corresponding input number respectively.
Procedure Intersect
n
: The Intersect
n
procedure for intersecting two sets of
L
n
expressions is described as a set of rules in Figure 6. The procedure computes
the intersection of sets of expressions by recursively computing the intersection of
their corresponding sets of sub-expressions. We describe below the four cases of
GenDFmt(d1: inp digits, d2: out digits)
1 d

1
:= RemoveTrailingZerosSpaces(d1);
2 l1 := len(d

1
); l2 := len(d2);
3 := numTrailingSpaces(d2);
4 if (l1 > l2)
5 (i1, i2, i3) := ([0, l2], [l2, l2], [0, l2]);
6 else if (l1 < l2) {
7 i1 := [l2, l2]; i2 := [l2, ];
8 if(l2 = l1) i3 := [, l2];
9 else i3 := [, ];}
10 else (i1, i2, i3) := ([0, l2], [l2, ], [0, l2]);
11 return (i1, i2, i3);
Normalize(n: inp number, n

: out number)
n1 = n = (ni, n
f
, ne);
if(Type(n) = ExpNum Type(n

) = ExpNum)
n1 := n 10
ne
;
if(Type(n) = ExpNum Type(n

) = ExpNum)
{n

= (n

i
, n

f
, n

e
); n1 := n/10
n

e
;}
return n1;
GenerateStrn(: inp state, n

: out number)
1 Sn := ;
2 foreach input variable vi:
3 nc = Canonical((vi)); n

c
= Canonical(n

);
4 if (nc = n

c
) u := Round(vi, (nc, n

c
));
5 else u := vi;
6 (ni, n
f
, ne) := Normalize((u), n

);
7 match n

with
8 DecNum(n

i
, n

f
, )
9 1 := GenDFmt(n
R
i
, n
R
i
);
10 if ( = ) Sn := Sn Dec(u, 1, );
11 else { 2 := GenDFmt(n
f
, n

f
);
12 Sn := Sn Dec(u, 1, , 2);}
13 ExpNum(n

i
, n

f
, n

e
, )
14 1 := GenDFmt(n
R
i
, n
R
i
);
15 3 := GenDFmt(n
R
e
, n
R
e
);
16 if ( = ) Sn := Sn Exp(u, 1, , 3);
17 else { 2 := GenDFmt(n
f
, n

f
);
18 Sn := Sn Exp(u, 1, , 2, 3);}
19 OrdNum(n

i
)
20 Sn := Sn Ord(u);
21 WordNum(n

i
)
22 Sn := Sn Word(u);
23 return Sn;
Fig. 5. The GenerateStrn procedure for generating the set of all expressions in lan-
guage Ln that are consistent with the given set of input-output examples
intersecting rounding transformation expressions. The rst case is of intersecting
a nite rounding format set r with another nite set r

. The other two cases


intersect a nite set r with an input-output pair n
p
, which is performed by
selecting a subset of the nite set of rounding formats that are consistent with
the pair n
p
. The nal case of intersecting two input-output pairs to obtain a
nite set of rounding formats is performed using the IntersectPair algorithm
shown in Figure 7.
IntersectPair((n
1
, n

1
),(n
2
, n

2
))
z := n

1
;

:= Divisors(|n

2
n

1
|);
S := ;
foreach

:
if( Max(|n
1
n

1
|, |n
2
n

2
|))
if(2 Max(|n
1
n

1
|, |n
2
n

2
|) )
S := S (z, , );
if(n
1
> n

1
n
2
> n

2
)
S := S (z, , );
if(n
1
< n

1
n
2
< n

2
)
S := S (z, , );
return S;
Fig. 7. Intersection of Round expressions
Consider the example of rounding
numbers to nearest 45 or 95 for which
we have the following two examples:
32 45 and 81 95. Our goal is
to learn the rounding format (z, , m)
that can perform the desired rounding
transformation. We represent the in-
nite family of formats that satisfy the
rounding constraint for each example
as individual pairs (32, 45) and (81, 95)
respectively. When we intersect these
pairs, we can assign z to be 45 without
loss of generality. We then compute all
divisors

of 9545 = 50. With the constraint that (Max(4532, 9581) =
14), we nally arrive at the set

= 25, 50. The rounding modes m are ap-
propriately learned as shown in Figure 7. For decimal numbers, we compute the
divisors by rst scaling them appropriately and then re-scaling them back for
learning the rounding formats. In our data structure, we do not store all divisors
Intersectn(Dec( u,
1
,

f), Dec( u

1
,

f

)) = Dec(Intersectn( u, u

), Intersectn(
1
,

1
),
Intersectn(

f,

f

))
Intersectn(Exp( u,
1
,

f,
2
), Exp( u,

1
,

f

2
)) = Exp(Intersectn( u, u

), Intersectn(
1
,

1
),
Intersectn(

f,

f

), Intersectn(
2
,

2
))
Intersectn(Ord( u), Ord( u

)) = Ord(Intersectn( u, u

))
Intersectn(Word( u), Word( u

)) = Word(Intersectn( u, u

))
Intersectn(v
i
, v
i
) = v
i
Intersectn((, ), (

)) = (Intersectn(,

), Intersectn( ,

))
Intersectn(Round(v
i
, r), Round(v
i
, r

)) = Round(v
i
, Intersectn( r, r

))
Intersectn(Round(v
i
, r), Round(v
i
, np)) = Round(v
i
, Intersectn( r, np))
Intersectn(Round(v
i
, np), Round(v
i
, r)) = Round(v
i
, Intersectn(np, r))
Intersectn(Round(v
i
, np), Round(v
i
, n

p
)) = Round(v
i
, IntersectPair(np, n

p
))
Intersectn((i
1
, i
2
, i
3
), (i

1
, i

2
, i

3
)) = (Intersectn(i
1
, i

1
), Intersectn(i
2
, i

2
),
Intersectn(i
3
, i

3
))
Intersectn((l, h), (l

, h

)) = (Max(l, l

), Min(h, h

))
Fig. 6. The Intersectn function for intersecting sets of expressions from language Ln.
The Intersectn function returns in all other case not covered above.
explicitly as this set might become too large for big numbers. We observe that we
only need to store the greatest and least divisors amongst them, and then we can
intersect two such sets eciently by computing the gcd of the two corresponding
greatest divisors and the max of the two corresponding least divisors.
Ranking: We rank higher the lower value for in the interval i
1
(to prefer
lesser trailing zeros and whitespaces), the higher value of in i
2
(to minimize
un-necessary number truncation), the lower value of in i
3
(to prefer trailing
zeros more than trailing whitespaces), and the greatest divisor in the set of
divisors

of the rounding format (to minimize the length of rounding intervals).
We rank expressions consisting of rounding transformations lower than the ones
that consist of only number formatting expressions.
Theorem 1 (Correctness of Learning Algorithm for L
n
).
(a) The procedure GenerateStr
n
is sound and complete. The complexity of
GenerateStr
n
is O([s[), where [s[ denotes the length of the output string.
(b) The procedure Intersect
n
is sound and complete.
Example 5. Figure 8 shows a range of number formatting transformations and
presents the format strings that are required to be provided in Excel, .Net,
Python and C, as well as the format expressions that are synthesized by our
algorithm. An N.A. entry denotes that the corresponding formatting task cannot
be done in the corresponding language.
Formatting of Doubles
Input Output Excel/C# Python/C Synthesized format
String String Format String Format String Dec(u, 1, (., 2)) or
Exp(u, 1, (., 2), 3)
123.4567 123.46
#.00 .2f
1 ([0, 3], [3, ], [0, 3])
123.4 123.40 2 ([2, 2], [2, 2], [0, 0])
123.4567 123.46
#.## N.A.
1 ([0, 3], [3, ], [0, 3])
123.4 123.4 2 ([0, 1], [2, 2], [0, 1])
123.4567 123.46
00.00 05.2f
1 ([2, 2], [3, ], [0, 0])
3.4 03.40 2 ([2, 2], [2, 2], [0, 0])
123.4567 123.46
00.## N.A.
1 ([2, 2], [3, ], [0, 0])
3.4 03.4 2 ([0, 1], [2, 2], [0, 1])
9723.00 9.723E+03
#.### E 00 N.A.
1 ([0, 1], [1, ], [0, 1])
2 ([0, 3], [3, ], [0, 3])
0.823 8.23E-01 3 ([2, 2], [2, ], [0, 0])
243 00243
00000 05d 1 ([5, 5], [5, ], [0, 0])
12 00012
1.2 1.2
#.?? N.A.
1 ([0, 1], [2, ], [0, 1])
18 18. 2 ([2, 2], [2, ], [2, 2])
1.2 1.2
???.??? N.A.
1 ([3, 3], [3, ], [2, 3])
18 18. 2 ([3, 3], [3, ], [3, 3])
1.2 1.20
???.00? N.A.
1 ([3, 3], [3, ], [2, 3])
18 18.00 2 ([3, 3], [3, ], [1, 1])
Fig. 8. We compare the custom number format strings required to perform formatting
of doubles in Excel/C# and Python/C languages. An N.A. entry in a format string
denotes that the corresponding formatting is not possible using format strings only.
The last column presents the corresponding Ln expressions ( denotes whitespaces).
5 Combining Number Transformations with Syntactic
String Transformations
In this section, we present the combination of number transformation language
L
n
with the syntactic string transformation language L
s
[6] to obtain the com-
bined language L
c
, which can model transformations on strings that contain
numbers as substrings. We rst present a brief background description of the
syntactic string transformation language and then present the combined lan-
guage L
c
. We also present an inductive synthesis algorithm for L
c
obtained by
combining the inductive synthesis algorithms for L
n
and L
s
respectively.
Syntactic String Transformation Language L
s
(Background) Gulwani [6] intro-
duced an expression language for performing syntactic string transformations.
We reproduce here a small subset of (the rules of) that language and call it
L
s
(with e
s
being the top-level symbol) as shown in Figure 9. The formal se-
mantics of L
s
can be found in [6]. For completeness, we briey describe some
key aspects of this language. The top-level expression e
s
is either an atomic
expression f or is obtained by concatenating atomic expressions f
1
,. . .,f
n
using
es := Concatenate(f1, . . . , fn) [ f
Atomic expr f := ConstStr(s) [ vi [ SubStr(vi, p
1
, p
2
)
Position p := k [ pos(r1, r2, c)
Integer expr c := k [ k1w + k2
Regular expr r := [ T [ TokenSeq(T1, . . . , Tn)
Fig. 9. The syntax of syntactic string transformation language Ls.
the Concatenate constructor. Each atomic expression f can either be a con-
stant string ConstStr(s), an input string variable v
i
, or a substring of some
input string v
i
. The substring expression SubStr(v
i
, p
1
, p
2
) is dened partly by
two position expressions p
1
and p
2
, each of which implicitly refers to the (sub-
ject) string v
i
and must evaluate to a position within the string v
i
. (A string
with characters has + 1 positions, numbered from 0 to starting from left.)
SubStr(v
i
, p
1
, p
2
) is the substring of string v
i
in between positions p
1
and p
2
. A
position expression represented by a non-negative constant k denotes the k
th
po-
sition in the string. For a negative constant k, it denotes the (+1+k)
th
position
in the string, where = Length(s). pos(r
1
, r
2
, c) is another position expression,
where r
1
and r
2
are regular expressions and integer expression c evaluates to
a non-zero integer. pos(r
1
, r
2
, c) evaluates to a position t in the subject string
s such that r
1
matches some sux of s[0 : t], and r
2
matches some prex of
s[t : ], where = Length(s). Furthermore, if c is positive (negative), then t is
the [c[
th
such match starting from the left side (right side). We use the expres-
sion s[t
1
: t
2
] to denote the substring of s between positions t
1
and t
2
. We use
the notation SubStr2(v
i
, r, c) as an abbreviation to denote the c
th
occurrence of
regular expression r in v
i
, i.e., SubStr(v
i
, pos(, r, c), pos(r, , c)).
A regular expression r is either (which matches the empty string, and there-
fore can match at any position of any string), a token T, or a token sequence
TokenSeq(T
1
, . . . , T
n
). The tokens T range over a nite extensible set and typi-
cally correspond to character classes and special characters. For example, tokens
CapitalTok, NumTok, and WordTok match a nonempty sequence of uppercase
alphabetic characters, numeric digits, and alphanumeric characters respectively.
A Dag based data structure is used to succinctly represent a set of L
s
expres-
sions. The Dag structure consists of a node corresponding to each position in the
output string s, and a map W maps an edge between node i and node j to the set
of all L
c
expressions that can compute the substring s[i..j]. This representation
enables sharing of common subexpressions amongst the set of expressions and
represents an exponential number of expressions using polynomial space.
Example 6. An Excel user wanted to modify the delimiter in dates present
in a column from / to -, and gave the following input-output example
08/15/2010 08-15-2010. An expression in L
s
that can perform this trans-
formation is: Concatenate(f
1
, ConstStr( ), f
2
, ConstStr( ), f
3
), where
f
1
SubStr2(v
1
, NumTok, 1), f
2
SubStr2(v
1
, NumTok, 2), and f
3
SubStr2(v
1
,
NumTok, 3). This expression constructs the output string by concatenating the
rst, second, and third numbers of input string with constant strings -.
5.1 The Combination Language L
c
f := ConstStr(s) [ v
i
[ SubStr(v
i
, p
1
, p
2
) [ e
n
u := g [ Round(g, r)
g := v
i
[ SubStr(v
i
, p
1
, p
2
)
The grammar rules R
c
for the combined lan-
guage L
c
are obtained by taking the union of
the rules for the two languages R
n
and R
s
with the top-level rule e
s
. The modied rules
are shown in the gure on the right. The com-
bined language consists of an additional expression rule g that corresponds to
either some input column v
i
or a substring of some input column. This expression
g is then passed over to the number variable expression u for performing num-
ber transformations on it. This rule enables the combined language to perform
number transformations on substrings of input strings. The top-level expression
of the number language e
n
is added to the atomic expr f of the string language.
This enables number transformation expressions to be present on the Dag edges
together with the syntactic string transformation expressions.
The transformation in Example 1 is represented in L
c
as: Concatenate(f
1
,
ConstStr("/"), f
2
, ConstStr("/"), f
3
), where f
1
Dec(g
1
, (2, , 0), ),
g
1
SubStr(v
1
, 1, 7), f
2
SubStr(v
1
, 7, 5), and f
3
SubStr(v
1
, 5, 1).
The transformation in Example 2 is represented as: Concatenate(f
1
, ":", f
2
),
where f
1
SubStr2(v
1
, NumTok, 2), f
2
Dec(u
1
, (2, , 0), ), and
u
1
Round(SubStr2(v
1
, NumTok, 3), (0, 30, )).
5.2 Data structure for representing a set of expressions in L
c
Let

R
n
and

R
s
denote the set of grammar rules for the data structures that
represent a set of expressions in L
n
and L
s
respectively. We obtain the grammar
rules

R
c
for succinctly representing a set of expressions of L
c
by taking the union
of the two rule sets

R
n
and

R
s
with the updated rules as shown in Figure 10(a).
The updated rules have expected semantics and can be dened as in Figure 4(b).

f := [ en
u := g [ Round( g, r)
g := vi [ SubStr(vi, p
1
, p
2
)
GenerateStrc(: Inp, s: Out)
= |0, , Length(s);

s
= 0;

t
= Length(s);

= |i, j) [ 0 i < j < Length(s);


foreach substring s[i..j] of s:
W[i, j)] = ConstStr(s[i..j])
GenerateStrs(, s[i..j])
GenerateStr

n
(, s[i..j])
return Dag( ,
s
,
t
,

, W);
(a) (b)
Fig. 10. (a) The data structure and (b) the GenerateStrc procedure for Lc expressions.
5.3 Synthesis Algorithm
Procedure GenerateStr
c
:
We rst make the following two modications in the GenerateStr
n
proce-
dure to obtain GenerateStr

n
procedure. The rst modication is that we now
search over all substrings of input string variables v
i
instead of just v
i
in Line 2
in Figure 5. This lets us model transformations where number transformations
are required to be performed on substrings of input strings. The second mod-
ication is that we replace each occurence of v
i
by GenerateStr
s
(, v
i
) inside
the loop body. This lets us learn the syntactic string program to extract the
corresponding substring from the input string variables. The GenerateStr
c
pro-
cedure for the combined language is shown in the Figure 10(b). The procedure
rst creates a Dag of (Length(s) + 1) number of nodes with start node
s
= 0
and target node
t
= Length(s). The procedure iterates over all substrings
s[i..j] of the output string s, and adds a constant string expression, a set
of substring expressions (GenerateStr
s
) and a set of number transformation
expressions (GenerateStr

n
) that can generate the substring s[i..j] from the
input state . These expressions are then added to a map W[i, j], where W
maps each edge i, j of the dag to a set of expressions in L
c
that can generate
the corresponding substring s[i..j].
Procedure Intersect
c
: The rules for Intersect
c
procedure for intersecting
sets of expressions in L
c
are obtained by taking the union of intersection rules
of Intersect
n
and Intersect
s
procedures together with corresponding inter-
section rules for the updated and new rules.
Ranking: The ranking scheme of the combined language L
c
is obtained by
combining the ranking schemes of languages L
n
and L
s
. In addition, we pre-
fer substring expressions corresponding to longer input substrings that can be
formatted or rounded to obtain the output number string.
Theorem 2 (Correctness of Learning Algorithm for combined language).
(a) The procedure GenerateStr
c
is sound and complete with complexity O([s[
3
l
2
),
where [s[ denotes the length of the output string and l denotes the length of the
longest input string.
(b) The procedure Intersect
c
is sound and complete.
6 Experiments
We have implemented our algorithms in C# as an add-in to the Microsoft Excel
spreadsheet system. The user provides input-output examples using an Excel
table with a set of input and output columns. Our tool learns the expressions in
L
c
for each output column separately and executes the learned set of expressions
on the remaining entries in the input columns to generate their corresponding
outputs. We have evaluated our implementation on over 50 benchmarks obtained
from various help forums, mailing lists, books and the Excel product team. More
details about the benchmark problems can be found in [22].
0
5
10
15
20
25
30
35
1 2 3
N
u
m
b
e
r

o
f

B
e
n
c
h
m
a
r
k
s

Number of Input-Output Examples
Ranking Measure
0
0.5
1
1.5
2
2.5
3
3.5
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
R
u
n
n
i
n
g

T
i
m
e

(
i
n

s
e
c
o
n
d
s
)

Benchmarks
Performance Measure
(a) (b)
Fig. 11. (a) Number of examples required and (b) the running time of algorithm (in
seconds) to learn the desired transformation
The results of our evaluation are shown in Figure 11. The experiments were
run on an Intel Core-i7 1.87 Ghz CPU with 4GB of RAM. We evaluate our
algorithm on the following two dimensions:
Ranking: Figure 11(a) shows the number of input-output examples required
by our tool to learn the desired transformation. All benchmarks required at
most 3 examples, with majority (76%) taking only 2 examples to learn the de-
sired transformation. We ran this experiment in an automated counter-example
guided manner such that given a set of input-output examples, we learned the
transformations using a subset of the examples (training set). The tool itera-
tively added the failing test examples to the training set until the synthesized
transformation conformed to all the remaining examples.
Performance: The running time of our tool on the benchmarks is shown in
Figure 11(b). Our tool took at most 3.5 seconds each to learn the desired trans-
formation for the benchmarks, with majority (94%) taking less than a second.
7 Related Work
The closest related work to ours is our previous work on synthesizing syntactic
string transformations [6]. The algorithm presented in that work assumes strings
to be a sequence of characters and can only perform concatenation of input
substrings and constant strings to generate the desired output string. None of
our benchmarks presented in this paper can be synthesized by that algorithm as
it lacks reasoning about the semantics of numbers present in the input string.
There has been a lot of work in the HCI community for automating end-user
tasks. Topes [20] system lets users create abstractions (called topes) for dierent
data present in the spreadsheet. It involves dening constraints on the data to
generate a context free grammar using a GUI and then this grammar is used to
validate and reformat the data. There are several programming by demonstra-
tion [3] (PBD) systems that have been developed for data validation, cleaning
and formatting, which requires the user to specify a complete demonstration or
trace visualization on a representative data instead of code. Some of such sys-
tems include Simultaneous Editing [18] for string manipulation, SMARTedit [17]
for text manipulation and Wrangler [15] for table transformations. In contrast
to these systems, our system is based on programming by example (PBE) it
requires the user to provide only the input and output examples without provid-
ing the intermediate congurations which renders our system more usable [16],
although at the expense of making the learning problem harder. Our expression
languages also learns more sophisticated transformations involving conditionals.
The by-example interface [7] has also been developed for synthesizing bit-vector
algorithms [14], spreadsheet macros [8] (including semantic string manipula-
tion [21] and table layout manipulation [12]), and even some intelligent tutoring
scenarios (such as geometry constructions [10] and algebra problems [23]).
Programming by example can be seen as an instantiation of the general
program synthesis problem, where the provided input-output examples consti-
tutes the specication. Program synthesis has been used recently to synthesize
many classes of non-trivial algorithms, e.g. graph algorithms [13], bit-streaming
programs [26, 9], program inverses [27], interactive code snippets [11, 19], and
data-structures [24, 25]. There are a range of techniques used in these systems
including exhaustive search, constraint-based reasoning, probabilistic inference,
type-based search, theorem proving and version-space algebra. A recent sur-
vey [5] explains them in more details. Lau et al. used the version-space algebra
based technique for learning functions in a PBD setting [17], our system uses it
for learning expressions in a PBE setting.
8 Conclusions
We have presented a number transformation language that can model number
formatting and rounding transformations, and an inductive synthesis algorithm
that can learn transformations in this language from a few input-output exam-
ples. We also showed how to combine our system for number transformations
with the one for syntactic string transformations [6] to enable manipulation of
data types that contain numbers as substrings (such as date and time). In addi-
tion to helping end-users who lack programming expertise, we believe that our
system is also useful for programmers since it can provide a consistent number
formatting interface across all programming languages.
References
1. R. E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE
Trans. Computers, 35(8):677691, 1986.
2. P. Cousot and R. Cousot. Abstract interpretation: a unied lattice model for static
analysis of programs by construction or approximation of xpoints. In POPL, 1977.
3. A. Cypher, editor. Watch What I Do Programming by Demonstration. MIT
Press, 1993.
4. M. Gualtieri. Deputize end-user developers to deliver business agility and reduce
costs. In Forrester Report for Application Development and Program Management
Professionals, April 2009.
5. S. Gulwani. Dimensions in program synthesis. In PPDP, 2010.
6. S. Gulwani. Automating string processing in spreadsheets using input-output ex-
amples. In POPL, 2011.
7. S. Gulwani. Synthesis from examples. WAMBSE (Workshop on Advances in
Model-Based Software Engineering) Special Issue, Infosys Labs Briengs, 10(2),
2012.
8. S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using
examples. In Communications of the ACM, 2012. To Appear.
9. S. Gulwani, S. Jha, A. Tiwari, and R. Venkatesan. Synthesis of loop-free programs.
In PLDI, 2011.
10. S. Gulwani, V. A. Korthikanti, and A. Tiwari. Synthesizing geometry construc-
tions. In PLDI, pages 5061, 2011.
11. T. Gvero, V. Kuncak, and R. Piskac. Interactive synthesis of code snippets. In
CAV, pages 418423, 2011.
12. W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples.
In PLDI, pages 317328, 2011.
13. S. Itzhaky, S. Gulwani, N. Immerman, and M. Sagiv. A simple inductive synthesis
methodology and its applications. In OOPSLA, 2010.
14. S. Jha, S. Gulwani, S. Seshia, and A. Tiwari. Oracle-guided component-based
program synthesis. In ICSE, 2010.
15. S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual
specication of data transformation scripts. In CHI, 2011.
16. T. Lau. Why PBD systems fail: Lessons learned for usable AI. In CHI Workshop
on Usable AI, 2008.
17. T. Lau, S. Wolfman, P. Domingos, and D. Weld. Programming by demonstration
using version space algebra. Machine Learning, 53(1-2):111156, 2003.
18. R. C. Miller and B. A. Myers. Interactive simultaneous editing of multiple text
regions. In USENIX Annual Technical Conference, 2001.
19. D. Perelman, S. Gulwani, T. Ball, and D. Grossman. Type-directed completion of
partial expressions. In PLDI, 2012.
20. C. Scadi, B. A. Myers, and M. Shaw. Topes: reusable abstractions for validating
data. In ICSE, pages 110, 2008.
21. R. Singh and S. Gulwani. Learning semantic string transformations from examples.
PVLDB, 5, 2012. (To appear).
22. R. Singh and S. Gulwani. Synthesizing number transformations from input-output
examples. Technical Report MSR-TR-2012-42, Apr 2012.
23. R. Singh, S. Gulwani, and S. Rajamani. Automatically generating algebra prob-
lems. In AAAI, 2012. (To appear).
24. R. Singh and A. Solar-Lezama. Synthesizing data structure manipulations from
storyboards. In SIGSOFT FSE, pages 289299, 2011.
25. A. Solar-Lezama, C. G. Jones, and R. Bodk. Sketching concurrent data structures.
In PLDI, pages 136148, 2008.
26. A. Solar-Lezama, R. M. Rabbah, R. Bodk, and K. Ebcioglu. Programming by
sketching for bit-streaming programs. In PLDI, pages 281294, 2005.
27. S. Srivastava, S. Gulwani, S. Chaudhuri, and J. S. Foster. Path-based inductive
synthesis for program inversion. In PLDI, 2011.

You might also like