Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
US011360740B1 ( 12 ) Kent United States Patent et al . ( 54 ) SINGLE - STAGE HARDWARE SORTING BLOCKS AND ASSOCIATED MULTIWAY MERGE SORTING NETWORKS ( 71 ) Applicant: UNM Rainforest Innovations , Albuquerque, NM (US ) ( 72 ) Inventors: Robert Bernard Kent , Albuquerque, NM (US ) ; Marios Stephanou ( 10) Patent No .: US 11,360,740 B1 (45 ) Date of Patent : Jun . 14 , 2022 (52 ) U.S. CI. G06F 7/16 ( 2013.01 ) ; G06F 77026 (2013.01 ) ; G06F 7/5443 ( 2013.01 ) ( 58 ) Field of Classification Search CPC G06F 7/16 ; G06F 7/026 See application file for complete search history. ??? U.S. PATENT DOCUMENTS Pattichis , Albuquerque, NM (US ) Subject to any disclaimer, the term of this patent is extended or adjusted under 35 U.S.C. 154 ( b ) by 0 days. ( * ) Notice: A * A B2 * B1 * 4/1984 Coleman 12/1986 Nelson 10/2007 Adas 12/2019 Ferger 5/2008 Mohamed G06F 7/026 382/218 G06F 7/22 HO4L 49/30 * cited by examiner Primary Examiner Chuong D Ngo Related U.S. Application Data ( 60 ) Provisional application No. 62 / 984,880 , filed on Mar. 4 , 2020 . G06F 7/16 G06F 7/02 GO6F 7/544 4,441,165 4,628,483 7,281,009 10,523,596 2008/0104374 Al ( 21 ) Appl . No .: 17/190,843 Mar. 3 , 2021 (22) Filed : ( 51 ) Int . Cl . References Cited ( 56 ) ( 2006.01 ) ( 2006.01) ( 2006.01 ) (74 ) Attorney, Agent, or Firm — Valauskas Corder LLC ( 57 ) ABSTRACT A system and methods for designing single -stage hardware sorting blocks , and further using the single - stage hardware sorting blocks to reduce the number of stages in multistage sorting processes , or to define multiway merge sorting networks. 15 Claims , 40 Drawing Sheets 100 In X input Port Values 160 Comparison Signals Block 120 Create N * { N - 1 }/ 2 input comparison Signals Output MUX Block Output MUX Select Line Signals Block Each Out Y multiplexer assignment T 140 For each Qut Y, create N - 1 in X goesto Out Y multiplexer select line signals contains Ninput data values and ( N - 1 ) multiplexer select line signals OutY Output Port Values U.S. Patent Jun . 14 , 2022 Sheet 1 of 40 US 11,360,740 B1 nax Comparison Block : One 2 - value Comparison Out o ge 10 Output MUX Block : 2-10-1 per-bit Multiplexers PRIOR ART FIG . 1 100 160 in X input fort Values Comparison Signals Block 120 Create N {N - 1 }/ 2 Input Comparison Signals 140 Output MUX Block Output Each Out y Values Output MUX Select Line Signals Block multiplexer assignment For each Out Y, create N - 1 in X goesto Out Y multiplexer select line signals N input data values contains and { N - 1} multiplexer select line signals FIG . 2 Port U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 2 of 40 module sort 9 values 8 bits # ( parameter MAX BIT_INDEX M w } ( input input input input input input input input ( MAX BIT INDEX ( MAX BITINDEX ( MAX_BIT_INDEX ( MAX_BIT_INDEX ( MAX_BIT_INDEX ( MAX BIT INDEX ( MAX BIT INDEX [ MAX_BIT_INDEX : : : : : : : = 0 0 0 0 0 @ 0 : : : : : : : 0) 0) 0) 0) 0) 01 0) input ( MAX BIT INDEX := the max value output output output output output output output ( ( ( ( ( ( ( MAX BITINDEX MAX BIT INDEX MAX BIT_INDEX MAX_BIT_INDEX MAX_BIT_INDEX MAX BIT INDEX MAX_BIT_INDEX output ( MAX_BIT_INDEX output ( MAX BIT INDEX the min value FIG . 3 ) ) ) ) ) 1 ) 1 In In In In In In In In 8 7 6 5 4 2 1 1 In 0 Out Out_7 Out 6 Out Out4 Out 3 Out 2 , : 01 Out : 01 Out o U.S. Patent 200 Jun . 14 , 2022 Sheet 3 of 40 US 11,360,740 B1 Start WWW Applying to input ports a list of N unsorted data input values 202 Using a comparison operator to generate result signals 204 Enforcing an order 206 Providing a set of output multiplexers 208 Assigning, in parallel, each N data input value to an output port 210 Outputting to output ports a sorted list of values 212 Stop FIG . 4 U.S. Patent Jun . 14 , 2022 Sheet 4 of 40 9 // The comparison signals ; " ge " for " > 1 / 36 comparisons for 9 - sorter wire ge_8_7 ( In 8 > * In 7 ) ; wire ge_8.5 wire ge 83 ( In 8 ge8 2 = ( In8 ge_81 ( In 8 ( In8 wire Wire wire wire ( In 8 ** In 5 ) W M WY ?== ** >= > ) In 3 In2 In 1 In 0 WY 13 ) ; ) ; ) ; ) 1128 comparisons for & Sorter wire wire wire wire wire 11 21 wire wire Wire wire wire ge 7 6 W ( In 7 > In 6 ) ; M sete ge_73 ( In 7 ge 7 2 * ( In 7 ge 7 1 = ( In 7 ge 7 8 = ( In 7 comparisons for ge_6 5 = ( In 6 ge_6 4 = ( In 6 ge 6_3 = ( In6 ge 6_2 ( In 6 ge 61 ( In 6 tetek ) > > >= In 3 ) ; In 2 ) ; In 1 ) ; In ) ; > > >* >* In 5 In 4 In3 In 2 In 1 topp ) ) ) ) ) ; ; ; ; 1/15 comparisons for : 6 - sorter Wire wire wire Wire wire ge_53 ( In 5 In w AN // 10 comparisons for 5 - sorter Wire ge_43 * ( In 4 * In 3 wire de 4.2 * ( In 4 * In 2 Wire wire ge 40 * ( In 4 * In 16 comparisons for 4- sorter Wire ge3 ( In 3 * In 2 wire ge 37 * ( In 3 » In 1 w M Wire ) ; ge 30 ( In 3 In ) ; ) ; ) ) ) ) 3 comparisons for 3 - sorter What mere ge_2 1 ( In 2 » In 1 ) ge_2_0 = ( In_2 > = In ) 1 comparison Sorten Wire ge 10 * ( In 1 > In A :: FIG . 5 US 11,360,740 B1 greater than or equal " U.S. Patent Jun . 14 , 2022 assign Out2 ( In 8 goes ( In 7 goes ( In6goes ( Ingoes w Sheet 5 of 40 to Out 2 ? In 8 : to out.2 . In : w w w to Out2 ? In6 : to out 2 In 5 : ( In 4goes to Out 2. In In 3.goes_to_out_2 In ( In 2 goes to Out 2 ? In In goes_to_out_2 In 1 w US 11,360,740 B1 WY 4 3 2 2 : : : : In_ ) ) ) ) ) ) ) ) ; FIG . 6 U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 6 of 40 Start 300 w Creating for each N data input all 2N -1 possible product terms 302 Select product term 303 Is data input on left side of operator AND signal state is non YES inverted ? 304 NO Assign a "win 308 YES is data input on right side of operator AND signal state is inverted ? 306 NO Is this the last comparison signal state in the product term ? 309 YES Sum the " wins " 310 Adding to SOP equation 312 Stop Determine output port assignment 314 FIG . 7 NO U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 7 of 40 In_Succes_to_out_5 * xy S & product terms wire 988..Sealine ? ? ? 66.8.5 ? 68 ges 88 The && ge 5..5 BB && ga ... S && 88 ge5 88 ses ges origem seis Sock ………… && 8 Big age 85 8 % ge 85 & { 28 get 5 Be * Le 4 se 5 && ges3 5 4 ods generalde & 88 ( ! ges g 5 88 ? : ? ?.75 & 88 $ ge 7 && ???? B3 && ! ge2 Suite ?? } ges S3 && se 5.2 SS ! Be S.2. S & ge } 265:58 $ & 5. && 26.5.2 LE ge 88 g 26 5 3 } I go 50 ) && se 5.2 ben 86 ge4 &&&&& ges S & ge 54 && be && gens && ! B. && SS3 && 8 & ge_7_5 88ge 5.5 : S & g 48 gen ? } geo && } 2 Set 825,2 Led geS3 && get && :1 } SS.SS PRES geri Segers 4 ge a ! ????? se 5.2 ge65 : Se & Q 6.5 S & 88 ge * && ge4 de 7 && gens && ! Se se ge 7.5 88 ge5884 & SS ge4 82 ge7 88 ge Se..9..2 Broase & get to see the 80 65 od tewme els & SE &&&&& geboren && se 5.2 845 88 gege ) X2 seu && { ! Be ... Sirge Seele && { ge75 && Sige ( ses So se & 5 8882558& ge55 && 8.35 88 { ?? & ! ge: 7.5 && SE 6.5 Si g & t ge5 So home 88 ? se 5.2 Set gens Bosses && get 3 && ! se S..2 FIG . 8 ?? gemeente se 5.2 se$ } . iii U.S. Patent Jun . 14 , 2022 module sort.values & bits #sorameter { US 11,360,740 B1 Sheet 8 of 40 MAX 817INDEX input ( MAX BIT_INDEX : 1 in 1:17 input ( MAX_BIT_INDEX ; 01 190 input ( MAX_BIT_INDEX : At the sax value output ( MAX BIT INDEX ; 8 1 Out output ( MAX_BIT_INDEX ; 61 : Out.. output ( MAX BIT INDEX : 0 ) Out the min value The comparison signals // 3 comparisons for 3 - sorter wire In Wire * // 1 comparison for 2 - sorter Wire wyn YAWAVAWAW MYA 1 / Insoes to out $ 34 ] tiplexer select line signals niyaya VANYAYA In goes to out ? ( wire In ? Wire In_2.goestowout . goes to Out ge 28 ) YAXXXVWAWA Imastoout..2 In 1 goes to Out : ( In 190esto out vanyAYVA 1 / The output port multiplexer assigndents assign Out 2 ( In 2 goes to out2 :? In 2 : { Imagta Out2 1 IM1 Ima } } ; assign Out ( In 2 goes to out1 ( In 1 goes to out1 assign Out ( In 2 goes to Out 19 ( Ilmoes to out endmodule // sort_values.bits FIG . 9 ? ? In 2 : :? In Ins ) : U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 9 of 40 400 410 440 in X input fort Values 2 Logic Blocks in Series Comparison Signals Block Create N * { N - 1 }/ 2 Input Comparison Signals 420A 1 2 - sorter Equivalent Propagation Delay Output MUX Block 1 * MUX Select Line Signals Block 3 Logic Blocks iri Series In X goes to Out Y 1.5 2 -sorter Equivalent In Xa OR Xb goes to OutY Propagation Delay One Multiplexer per Output Value Bit My 24 MUX Select Line Signals Block In_X_goes_to_Out_Y signals 420B - OR equation combinations of Output Number of Multiplexers * N * Bits Per Value 4 Logic Blocks in Series 2 2 -sorter Equivalent Propagation Delay FIG . 10 sortex dow ROW Data 7. B Sun 7 house The 10 $ 9 19 General Hardware Sorterembodiment Data $ ssputs (and Outputs sss a full list Sorter) 2 Comparisons 3 goes to Out y Select inputs for Output Mix Data inputs * In_X_es_to_Out y Select Inputs 1 3 goes to Outy Product Ferm 4 10 1 3 3 Osta inputs * Comparisons 6 Comparisons in in 2 10 2 5 * 15 2.3 23 28 28 35 3 2 7 9 3 24 23 7 HardwareSorterEmbodimentVtifixingDesign Block WithfourorEight& -BaputLUIS finput LUTs per goes to Out Product Terrss. & Lagie Black Stages in Series Sorter Propagation Delay { 2- sorter Equivalent) 3 1 2 2 3 18 inputs Per Qutput Bit Multiplexer 3 som FIG . 11 i ? W ? 3 3 sa on 4 & $ 1.5 1.5 2.5 2.5 2 2 © 13 LUTS Pes Output Bit Mux ( 3* Effectively May Be a 3 2 3. 9 12 12 13 28 2 2 2 3* U.S. Patent Futy Sorter: V10 - V Slice blocks in Series 2 - sorter Propagation Delay Jun . 14 , 2022 Sheet 10 of 40 everything Go Sasorter 6 - sorter 7 - sorter 3 2 -sorter -sortes 1.5 738 2 - sorter LUT Kesources 1.8 3 6.8 2 -pax 2-72x 4-1732x S - max Slice blocks in series i ju 7 3 Physh LUTS 12 Z - sorter Propagation Delay 2 - sorter LUT Resources 9 sorter 3 2 PhyscialLUTS N - Max Fitter: N - US 11,360,740 B1 ? 2 2 13,6 67 % 24.4 9-192x -ax 3 1.5 72 83 3.6 4,15 S 2.5 The 5 -median filter has the same propagation delay and hardware resource usage as the 9 -max filter, AULUT resource usage numbers assume that the data values are 8- bit unsigned nunbers . FIG . 12 210 20.5 U.S. Patent Jun . 14 , 2022 Sheet 11 of 40 Amor AMOWA nagore. Men onartun sousmanganmengonsumo de consomma US 11,360,740 B1 wanaweAMWAMWAMoserMowania // * vsorter general hardware design equation has / inputs i won't fit in aninput LUT 1 / combine thres In * In ** In ** signals into two In 3 OR In 2 * IN 3 OR IN 1 * Sigmals assign Out 2 YANVAWW ( Insoes ( In 2gbes ( Ingoes YAWA // General Hardware * goes Out 2 XX I In 2 In I www.nuoma to Dutz ? In : to Out In : LR_3 } WWWWWWA Desige 14-60 - I LUT Multiplexer Design * goes to out2 IN OR IN2 to Qutz In In 3 2 // periods , 2 2 i In IVAVA KVALVAVNA $ , replace as in the 3bove truth tables XXYYYAA . di Code implementation for the 4 - to - 1 multiplexers using 2 X Select links wire 18 InLORINS.o.out.2 ) } { } { wire In 3OR IN1 goesto out 2 { { { { wire get ( 4:01 MEX selects out 2 In 3 OR IN 2 goes to out 2 IROR In goes to Out 2 } ; Aweyit MVVVKYVYVAXY assign but * { mux selects out 2 ** 7 ' 5:13 ( muxSelects Out2 * 7'610 mumSelects_out.2 ** ' 83 In 3 : IM.2 : In Ins ) : FIG . 13 ge } ; U.S. Patent Jun . 14 , 2022 Sheet 12 of 40 US 11,360,740 B1 1 / 5 - Sorter general hardware design Out equation has 9 inputs , 5 data inputs and 4 x select line inputs Commented out beloni . 11 Requires 2 LUTS , plus their connor MUX , for each output bit multiplexer Vis AVNAYAYAYAYN 11 assign Out 2 ( ( ( ( In4 goes to out.2 ? In 4 : IngoestomOut.2 IBM : In 2goes to out.2 ? In 2 : Ingoes to Out 2 In 1 : w w M M Ha MardeMarieMarathikitehadeMateriadeMareDorothea ... 1 LUT assignment , uto - 1 multiplexer , as in a musorter Mara MariaMyMany MAMMA MAMAYAN... wire ( MAX_BIT_INDEX : ® 1 LUT_8_Out 2 datafrom In 2 ORL_OR_ ( 192.goes to Out 2 ? In 2 : ( Ingoes to Out IB M : In ) 1 / LUTA assignment ; 2 - to - 1 multiplexer , 2 $ in @ 2 - sorter FU VYA wire ( MAX_BIT_INDEX : ( In ) LUTA Out 2 datafrom In11 4 OR3 goes to Out 2 ? : In : IB3 } 1 MUXF7AB assignment ; 2 - to - 1 bit multiplexers , Inputs are LUTA and LUT outputs genvar bit index Eenerate for ( bit index B bit index ** MAX BIT INDEX bit index bit index ) begin MUXF7 suxF7 A & _out_2_1nst bitindex 1 ) LE .IO ( LUT_8_out data from IR_OR_2_OR_Or bit index 1 ) , Y 2 - bit date output - bit data input bit data input 11 ( LUTA Out 2 data from In 4 OR 31 bit index 1 ) , SC IR_4_OR__goes to out 2 ) ) : / ? - bit select input w end endgenerate FIG . 14 U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 13 of 40 // 5- sorter In OK speso Out product teras have 7 separate comparison signals Can't Fit N single fan input LUI // Use % wis plus their COBRON XX3X to create 7 - ingat EUT IR_4_08__Des_to_out_2 * ( BE4.3 && ! De42 && ! Se .. 1 && ws ?? de There are non a total of 6 separate comparison signals in the BUT prwuct terns y Common signal se repoved will become the 2 - to - It Rhix select Signal NYUN Wir LUTSELIN 4 DR . ?? DR . to Dut. 42 Be Be wire CM See LUTE42.8 190Kbvesto 18042 SI ES AyWynwy WANNAMYwNNAVA 112-10-3 multiplexer to finally create In 4 OR_goes to Out2 signal wwwwww???? VAN VV XXXF7 PRIXF7 4 Outlast hit data output ..206 LUTEIN 4 , OR 90 to Quta } , 32- vit data Input 3. LUTA BEROR 3 goes to Dutch ) 13- it dats inut ) Hill bit select input FIG . 15 3.1 PS Ige U.S. Patent Jun . 14 , 2022 Sheet 14 of 40 9 US 11,360,740 B1 Each of these select line signals is created in the *** MUX Select Lines Blocks They are all " OR * functions of several In x goes to out y signals The signals for all associated sorters propagate through 4 logic blocks NyhONNENNUNNIvanarnarnyaNINGATUNKUNNANVARena tuntunTVANGUNANNARASVATUrena } 6 - sorter 2-10-2 Select line signal for MUXF7 in Output MUX Block VANN In 5 OR 4 OR Wire ces to out2 M ( In _ $ _ goes_to_out_2 11 In_4_goes_to_out_2 11 In__goes_to_out_2 ) ; by MyHouse whereas 1 / 7 - sortege 2 - - 1 Select line signal for MUXF7 In Output MUX Block In 6 OR_5_OR_4_goes_to_out_2 wire M ( Ingoes_to_out_2 || In 5 goes to out_2 11 In 4 goes to out2 ) ; w M Wweet MENYAWA ** A ATUA // 8 - sorter 7-10-3 Select line signal for MUXF7 in Output MUX Block Viv In / OR wire OR OR 4goes to out2 M 11 In 6 goes to out_2 11 In 5 goes to out2 11 In 4 goes to out 2 AY Matomas /19- sorter 2 - to - 1 Select line signal for MUXF7 in Output MUX BLOCK InS_OR_4_OR goes to out2 ( In_s_goes_to_out_2 I| 193_4_goes to out 2 11 In_3_goes_to_out_2 ) w VAAJA // 9 -sorten Zut0-4 Select line signal for MUXF # In Output MUX Block wire In 8. OR 7 OR_6 goes to out_2 ( In_8_goes to out_2 | A In 7 goes to out2 11 In 6 goes to out 2 ) FIG . 16 U.S. Patent WIM Jun . 14 , 2022 US 11,360,740 B1 Sheet 15 of 40 LYAIRS.koes twout Se1 • ? • ? .. de 28 ?? - ?? ??????? ??? se , Bus LUygosto Ove.Rets 1.87 % wire ??????? & SAS de5% 8 de 52 && Hege 5.2 B & se .. de ... . LU . { U . I? ge toosterfe751 de3 R8 ????? & ! { ges deage cet ? ??? . ce 5 R8 ge ge53 S & ge 52. && Ww de 50 . . ! ??? sele SE548 80 M I ?? se 53 BS && 53 S | te LIT_O_ 19roes to out Agers are wine . ??? . ? ????? ! | Be538 A SE gemeen Best ????? ???? .. 3 ME A ????? ??? BE ?? ! Se 5 beste ge5 . * { RES ?e & ME de do Sul ? KE | ge 1. I BE0 Be3 && { • ????? & I gestos ?? ge5 am Bologna ge ge5 && get de Somme && Bell ge_52 RE ! de 52 | ge wife teren get && | ????? e | Bess & gens & te &S SE ce && Re... 1 ??? && ge3 && FIG . 17 getme lees & ge 59 F ... ) U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 16 of 40 WWWA 1 / 2 - to - 1 mux TWO LUTS with ge _ $ _ $ Il behavioral equivalent : 2 ; ge_7_5 is the mux select line Www MUX AB In 5 to Out5 ge851 ( ges ? LUTAIn5goes to Out5ge5 Inge751 : LUTBIn5goes to outgesige ? ..50 ) ; ht ht Me wire MUX_AB_In _ $ _ to_out_5_be_8_5_S wers MUXF7 muxF7In to Outwee 8 minst ( OP MUX_AB_In_5_to_out_5_ge_8_5_1 ), .IO ( LUT B In 5 goes to Out 5 ge 85 1 ge 750 ) , M .11 ( LUTAIN 5 goes to Out 5 mge851ge7 . ) , M ge7.5 . ( //111 - bit data output 11- bit data 1 - bit data input input ) ) ; // 1 - bit select input WAV 1 / 2-10-1 MUX ge_75 is the mux select Line TWO LUTS with ge_8_5 NAVIY UNAONYVANNYNN wire MUXCOIn Sto Out 5mge850 ; MUXF7 muxf7 In ( to Out 5 de 850 inst output MUX - COIN.Qoutube IO ( LUT_D_In_5_goes to outsge_5_cute ) , //1.bit data input « I1 ( LUT In 5 goes to out s ge5be751 ) 11- bit data input ge 75 Dit - bit select input SC V 1 / 2 - to - 1 mux wire combine the outputs of the 2 muxes above using ge 85 as the mux select line In 5 to Out 5 ; MUXF8 muxF8_In_5_to_out_5 inst ( ) .00 In 5 to outs .IO ( MUX CDIn to Out 5uge850 ) , I1 ( MUX_AB_In_5_to_out_5_ge_8_51 ) , 50 ge_85 11- bit data output // Inbit data input 111 - bit data input ) ) ; // 1 -bit select input FIG . 18 C U.S. Patent Jun . 14 , 2022 morethan the onetothe tow Matutessasi tantointhewomen and the US 11,360,740 B1 Sheet 17 of 40 AYWAXAAWYrity VAT // * Sorter Out 2 Output Multiplexer LUTS : Behavioral Code * A wire WAYAW MAX_BIT_INDEX < 0 In 2 goes to out_ {( In goes to out.22 LUT_B_out_2from1n 2 OR 2ORE In2 In 1 In ) ) ; 0 1 LUT A out 2_fron In 5 OR 4 OR_3 In 53 M ( MAX_BIT_INDEX ( n goes to Out wire Were ( IN_4_goes to out 2 wire WA In 4 W | MAX 817 INDEX ; LUTC Out 2 from In_8_OR_7_OR_6 } ( Ingoes to Out 2 ? ( In goes to out 2. In7 w I swear to thesite with thesame strutturation surturatura 19- Sorter Out In 6 ) BENANNTMuantum OVOAREwanguNatvaSVARENAVA Output Sit MUXF * s Structural code in a generate block MUX ABOut_2 MUX CD Out2 MAX_BIT INDEX : wire In ) Benvar bit index generate for ( bituindex bit index < = MAX BIT INDEX YA bit_index bit index * I. ; } MUXF7 HUXF7AB Outminst ? bit index 06 MUXF7AB Out 2 ( output ), bit data iaput 104 LUT & Out 2 from In 2 OR_1_OROC bit index 1 ) * 116 LUTA out__from_In _ $ _ OR_4_OR_31 bit_index 1 ) . 2 - it data Input In_S_OR_4_OR_3_goes to 013 * 2 ) ) ; // 1 - bit select input { UIXF * * uxf7 CD Out Inst bit data output .O ( MUXF7 CD Out 2 bit index 1 ) , ) 18 ( 1'60 1 bit data input 11 ( WTC Out 2 from In 8 OR 7 OR 6 ( bit Index ) ? ? M V . + Bli Xbit select input - S { vbi MUXF8 muxf8_out_2inst ? 101 Out 21 bit Index ) ) , MUXF7_AB_out_26 bit index ) ) , MUXF7 CD Out 21 bit index 1x .58 In8 OR 7 OR 6 goes to Out 2 end 11 for ( bit index = endgenerate FIG . 19 11- bit data outout // 1 - b1t fata bit data input input !! Lbit select input U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 18 of 40 WNA X ! The behavioral code for these & LUT equations seeds No sodifiuition ??? ? VUNA AMNVIVAVAA wire ( MAX_BIT_INDEX : BLUT AQUtember 20111 wire ( MAX 811 INDEXILUT 80ut22e21e_22110 In 2 : ? wire [ NAX_BIT INDEX : 91 LUT out 2e 3.2ge 2 se $ 101. ge Y 182 In 3 : In B ):) In * ; IRB } } wire ( MAX SIT, INDEX : LUT QUtge0198 Ini ( 1 gemini Wir ( MAX.81YWINOEX : LUTE_QUt2be2ce20_011 wire ( MAX_BIT_INDEX :BLUTOUTER ge IN mbele 2010 Ini 2BIN 2 : ) wire ( MAX KIT INDEX ON LUT 0 Out our otele 12.601 wire [ MAX HIT INDEX * 0 LUT QUE 1. BOB UNAM 1 ! The behavioral code below needs to be replsced with separató wire declarations , 3 generate block , and structural instantiations of MUXF7 / 8 / 9 primitives My www.marathon wire I MAX KIT INDEX + @ MUXF7 48 Outugemeen 31 ( 18 ? LUT QUthege.222.6112 WY_2_006_22_1110 ) ; wire MAXSIY INDEX : MUXF7Cooutware 2.8 { eenB ? LUTE DutmaBehandelmed IAX - UT . Outween these menee 100 ) wir ( MAX 817. INDEX ; MUXFY.EE Out22ge 2202 C de 10 LUTOst. Zwembe 911 : { U10e220_018 ) wire ( NAX_BIT INDEX 10 MUXF7GH_OUE2_32_2_00 ge1 LUT &Oute 12.61 : LUTH0012.ge 3.2.2.133.686 Waar was wir. ( MAX & IT_INOEX : MUXF & _ABCD_QUt22 bet ? MUXF7.AB Outube 23.11 : MUXF7.COout Zeemee ? 19 ) wire ( MAX_BIT_INDEX : 0 MUXF8EFGH_outer 20 6 se MUXFT.ET QUtemile BB : XOXF75_outube 60 ) ; assign Out M ? be ? AUX $ & ABCDsteem MUXF8EFOHLOUBE FIG . 20 U.S. Patent ????? Jun . 14 , 2022 ?? ?? ????? US 11,360,740 B1 Sheet 19 of 40 ?? ? ?? ?? ?? ? ??? ???? ?? SV pseudocode ; Frog Bevious examples , one skilled in the 3rt can implcrent “ wiren seclarations as needest " assign statements as needed a generate block as needed muxe * structural instantiations as needed ????? ?? ?? ?? ??? ? 1 / 4X ????? ???? ?????? ????? // Investo.Out.my ana * signals always have a hit width of 1 19 10.2mg04.20_03- ( 13.2 // in Lugoes to our m3 ( w ( # These bit output multiplexer signals 311 wave a bit width of ( MAX BIT INDEX LUT0U _3_ge_3__ LUT Out * { { ge , se 2 B } ** *** 11 ? IN ) ; ( ge IM 2 : IN 3 ) { ge_32 I AYA_Out sugo.03.23 : LUT. 8.01.3.3.3.28 ) ????? ??? ?? ?? ???? ?? ??????? ?? ?? ?? ???? ???????? ????????? 11 : 38F7A3.09.03 ???? ?? ?????? ?? ?? ? 1 ! The wak In moesto_Out signals and states can be put into the following diagram: s / for readability // www BT state is shown as an * tote VAM & B 88 M A 2ID 2 1 / ???? -Bir ?? ?? // InX.gies.com.out ... and gen signais always have a bit width of VAUVA M // In tagoes to_out wax t 17 is Daroes towtw ( // These bit output multiplexer signals 311 have a bit width of ( xAX_BIT_IKOEX * * } LUNA_Out3.25138 ) LUT80Uere { 32 21, 20 2008 ? 63 : { 2288 E. In : The ) ) . IN 2 : IB ( ? In AUTUBoutube.3.3. LTA ... Mt 20 ) FIG.21 An Ins; * UXF7A8 Okt U.S. Patent ! ???? 5V pseudocode Jun . 14 , 2022 US 11,360,740 B1 Sheet 20 of 40 ?????? ? ???? ??? from previous examples , one skilled in the art can implerent " wire ” declarations as needed " assign statements as needed generate block as needed MIX * STIC ??? ? ? antiations as needes 1 / ???? ????????? ?? ???? ?? ????????????? ???? ???? ??? ? ? 1/ 5 ?????? ? ?????? ?? ??? // wwwww ~~~~~ // Far reatsbaitys a ' a ' state is $ rown as an " E " state in the Following table // ge__3 will be the MUXFB select line signal // Set will be the select line signal for the two UXKZS ?? 4444 335 22 * ?? ? ? 11 a LUT_A_QUt__ee_22_11 ( ge_26 ge48 ) **** 612 IN 4 : { ge 2 LUT B_out 4.04.3.08.2.1.18 % 1 go.2,48 ) ~ 26 7 10.4 1.10 In 2 : 10 ) 10.10.0 ) ) MVUMA LUT_ Outube_01 ( 2. ge ** 2.611 AUTO_out 4. ga 4.3.8.2.3.00 % { { ga 31 go.38 ) IN : ( 80 20071R 3 : { e. In : IRO ) Int : 100 ) ) ; UXFT_AS_Quttages sui. * ( 22 ATALONE 48.3.21 147 BLO.3.2.1.15 ) UKFY.CO. Outw4.ne4.3.8 . ( de 22 ? LUTC_Out_40_4_1_2_01 ; LUK0 Out_se_2467 ) ; winyim MUXF & _ABCD_Out 4 * ( 43 NIKEY_AB_0xt_4.age_4_31 MUXF8 ABCD OUT FIG.22 U.S. Patent US 11.360,740 B1 Sheet 21 of 40 Jun . 14 , 2022 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< SY Scade ; “ ? ” ? “ ?? 5 , ?? ? ratest ? ? ?? ? ? , <<<<<< { { AAAAAAAAAAA B- $99x ?? ???? ? ? ????AAAAAAAAAAAAAA <<<<< AVAVA ?? ? ? ? ?? ? ?? ?? : ? ? ? ? ? ? ? ? ?? ?? & ?? 23:33 ???? 4 ?? 82 ?? , YVAUVAVAVAVAUVAUVAVAWTVA ? , ??? ? , ?? ? ? ? ? ? SA . ? ? ???? ? & x 11 In_5_goes_to_out_7 _7.g ? ? _ ?? _it ? ?” ? ? X ? res 2 ?? ,?? .2 .? O S , .. ? ? _?? _3ge5 Ret_ ? wmowarewmunowmowmowma 53??? ?? ??? ? X_ ??? ? ?? ? ??? select line For MUXF7 select line for MOXFS tva ??3.7 IT & 5.7.8 AB Out ABCO QUE VAIV AV VMVVM // Output MUX Blocks signals with bit width AppMyA ( MAX BIT INDEX ) ? ht? _ ??? ? ?? ?? 2 ° ? ? ? 383 ; Kiez ? E_RE ... xxz' ? ? ? ? _3 : 3 ? ? ? ecs_f7_8 ? ‘ :: ? ; ? 8 LUT A out7 ( ? 5.xo?S. Cruit_ ??3:41 : w ? ? ABSAY . ? ? { RAVES & St_7387 STER { MUXF7 CD Out7 " Xv8xt . AT : ? : ) ; I 8.7} ; . ; ???? Y_C_ ? ? VXKANA { 7 VR_5_0& x.g : SRF7.ht FIG . 23 ? XF8_AP #ext_ ? : ??7A8_out ) ; U.S. Patent 350 Jun . 14 , 2022 Sheet 22 of 40 US 11,360,740 B1 Start Providing hardware N - sorter 352 Removing unused output and related logic 354 Creating a single stage hardware N to M filter 356 Stop FIG . 24 U.S. Patent Row 7 6 5 3 2 1 Jun . 14 , 2022 Sheet 23 of 40 3 2 1 32 28 :24 31 27 23 12 15 11 30 26 22 18 7 3 29 25 21 13 quran US 11,360,740 B1 qis the sequence number Nrows 8 No Ncols * Nrowse = 32 Top row (NrowEp ) Bottom rowo Left column ( Ncols - 1 ) Right column = 0 FIG . 25 Symbol Dention Equation ( 1 ) Parameter Definitions The total number of value to be porta in the All UCMS Network The number of sorted lists to be irierged in each UCMS merge sequence The number of columns in a UCMS rectangle The sumier of rows in the sequence 0 2-0 irray 3 Nruso Neols Each column of the Sequence 0 2 - D anlay is sorted by an Nrows sorter Pina 1 1 : The muniber of the merge sort sequences in the network SCPJURCE fissi is the last sequence in thesouting network Rectangle Definitions, q > 1 See t'te figureimmediately ahove The number of values in a Sequence q rectangle Na Nrowse The sumber of rows it a Sequence iz rect3ngle: Num rectanglesa The number of rectangles in Sequence a Time and Resource Units for Sorter and Network Normalization 3-surter Equivalent Ime Propagation delay of one 2- sorter, or of a stage with only 2 - sorters 2 -sorter Equivalent kse Resources used for a 2 - sorter in i purticular hartware type Unified Column Merge Sort, this sorting network systein FIG . 26 U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 24 of 40 The original list of 32 unsorted values for the 4 -column UCMS example is 22. 2.8 38 25 33 33 17 16 31.923 39 24 27 22% 24. 2. 25 27 25 25 26 Sequence 0 : Unsorted List of 32 into 16 Sorted Pairs 8 3rd moramo 11 30 2 ***** 18 29 1 3 6 23 24 3 5 1 2 722 13 14 25 15 28 6 6 4 8 Paired But Not Yet Sorted 16 15 14 13 12 11 10 1 9 8 1 2 3 1 13 22 25 172& 9 31 10 16 5 8 20 24 3 Paired And Sorted Sequence 0. Stage 1 : Unsorted List of 32 Broken into 16 Pairs ; Pairs Then Sorted FIG . 27 Sequence : Four 2 ROWS * 4 Columns Rectangles Merge Sort Stage flow is from top to bottom in Each rectangle Column NIC : Row /Col Deltas Between Successive Diag Salsations 3 1 ROW 3 18-30 532 2 3 12. 19 - 0 0 Row 23 1 3 27 ppi $ * 13 * 22. 24. 3*** 14 0 Row 3 25mph 1730 * 3 Row 3 3 Sequence 1, Stage 1 : Build Four 2x4 Rectangles and Sort Al Rows ROW RON i 5 1 1 + 3 2 ž 231932 3 20 16 10 1 29 31 2. *** i 3 2 27 32 33 a 28 : 25:17 0 0 26 21 15 6 How 3 3 0 28 262135 Sequence 1 , Stage 2 : 1/1 Row /Column Oiagonal Sort Row 1 0 2 i ROW 32 31 29 11. 2 3 How 2 1 1 19 1 0 3 2 1 3 25. 178 0 22 13 Sequence 1, Stase 3 : 1/2 ROW / Column Ojagora Sort RO 3 1 8 32 3 2 Row 3 1 1 2 3 2 i 2.7 34 22 Row 23 2 1 2 21. 258 Sequence 1 , Stage 4 : 1/3 R / C Diagonal Sort ; NG " Intkowi Sart Since Threre Are No Interni ROWS 3 1 2 $ 31 30 29 S 2 Row 3 3 2 2 28 26:25 21 2 1 0 1.2. 1. 13 : 3 4 3 Sequence 1 Complete : Each of the four 2x4 Rectangles is Now Sorted FIG . 28 : 3.7 25 6 U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 25 of 40 X Columns Sequence 2 Merge Sort Final Rectangle . S ROWS ** Stage flow is left to Rigist: Starting in Top Rectangle: R13W 2 7 ? 0 Row 3 2 5 31/25/24/20 30/25/2219 6 31 29 18 717 / 13 12 15/31/1077 3 38 35 29 721/16/14 9787574 2 3 ? something 32 30por27 : 23 32 28 27 23 2 1 1 7 2 heretoget 7 3 3 3 1 How 7 3 2 3 0 32 31 29 26 S 27 21 21. 19 23 20.12 1.4 3 2 3 0 30 28 25.2 % .. 38 362312 6 www 2 Stage 1 : Son All ROWS Stage 2 : 2/1 R / C Diagonal Sort Stage 3 : 1/1 A / C Diagonal Son Stage # : 12 0 3gons Sort Kow ? ROM 3 32 31 3 7 2926. 27.24 6 5 3 3 w ... 32 31 30 29 R / C : Row /Col Oeitas Between Successive Diag Selections 24 23 22 21 20 19 18 17 Stage 5 : The Last Stage in Sequence Includes UCMS 1/{ Ncols- 1 ) R / C Diagonal Sort 28 27 26 25 2 2 5 3 2 a 3 3 2 1 Stage S : 13 Diagrirtkow Sort $ 49.2 Done : Sorted Order Also , includes " intkowo Sort For uven Ncols values > 2. Iritrow Sort in Last Stage {aturnal values of a row are sorted, frosn 1 to { Ncols- 21 FIG . 29 Anternal rows are sorted from 1 to NOWS - 2 } U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 26 of 40 The original list of 9 unsorted values for the 3-column standard UCMS example is 4 6 9 2 7 1 8 5 3 Sequence 0 : Unsorted List of 9 ; 3 Lists of Length 3 ROW ROW ? 2 1 7 11 5 3 2 1 9 *** 6 WA 2 Lists of 3/3/3 Built But Not Yet Sorted 5 pther3omotion 1 Lists of 3/3/3 Built And Sorted Sequence 1 : Final Single Rectangle With 3 Rows x 3 Columns Stage Flow is left to Right , Starting in Top Rectangle Row R / C : Row/Col Deltas Between Successive Diag Selections 2 1 2 2 2 1 1 .. awa 7 6 2 2 9 I 8 Stage 2 : 1/1 R / C Diag Sort Row 0 2 1 0 1 2 Stage 3 : 1/2 R / C Diagonal Sort 3 2 ?? Stage 1 : Build 3x3 Rectangle : Sort All ROWS 1 ww 2 3 6 S 4 3 2 1 tuk1 Final Sorted List of 9 Values FIG . 30 U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 27 of 40 The original list of 8 unsorted values for the 2- column standard UCMS example is 4 627 3 1 5 8 Sequence 0 : Unsorted List of 8 Into 4 Sorted Pairs Paired And Sorted Paired But Not Yet Sorted 8 Row C 2 5 19 11 1 1 1 7 B 6 8 8 2 tout Sequence 0 Stage 1 : Unsorted List of 8 Broken into 4 Pairs ; Each pair Sorted Sequence 1. Two Z Rows x 2 Columns Rectangles Merge Sort Stage Flow is from Top to bottom in Each Rectangle Column Sequence 1 : 2x2 Rectangle 0 Sequence 1 : 2x2 Rectangle I ROW 1 o Raw 1 0 1 how 0 I // 6,4 swapped 2 m // No change : 3,1 Sequence 1 , Stage 1 : Two 2x2 Rectangles Created ; Rows Sorted ? Row 0 // No change : 4,3 1 11 8,5 swapped 11 7,2 swapped 0 1 // 7,5 swapped @ Sequence 1 , Stage 2 : Two 2x2 Rectangles: 1/1 Row Column Diagonal Sort Sequence 2 : Final 4 Rows x 2 Columns Rectangle Stage Flow is left to Right , Starting in Tao Rectangle Row R / C : Row / Col Deltas Between Successive Diag Selections Raw 3 1 1 / 8,6 swapped 47 117, $ swapped 3 5 1/5, 3 swapped Row & 3 2 > 1 @ 1 G // No change : 6,5 1 No change : 4,2 entrega l} 7 , 1 swapped Sequence 2 , Stage 1 4x2 Rectangle ; Sort Rows Sequence 2 , Stage 2 : 2/1 R / C Diagonal Sort Row 3 . 1 I 8 0 - 2 // 7,6 swapped Row 3 2 // 5,4 swapped I No change : 3,2 Sequence 2 , Stage 3 : 1/1 R / C Diagonal Sort 1 7 mw junih 0 2 Final 2.Column Sorted List of 8 Values FIG . 31 U.S. Patent Jun . 14 , 2022 Sequence Number of Number aq Sorters 0 Nrow so US 11,360,740 B1 Sheet 28 of 40 Nrowsq in Sequence a Number of Rectangles Num recanglesa Nfonul 91 Nfina ! NrowseNcols Nrowse Nrow50 * Ncolsa FIG . 32 450 Streaming Results Data Qut Dat Out I ? Control Streaming Datos 460 Streaming interface to Host Comite : Sorted Datannit List Length - Napo? Urisorted Data in 480A UCMS Sorting Network Top Level / Data Transfer Wiring 470 Sequence 0 : 1 Stage Sequence 1. Nools Stages Final Sequence has I Rectangle : Morfirin Nools * Nrowsfiro! Number of Stages - Acols + CEILINGI logg{ ASOWSER !! Hcols ) } if Final Sequence is not Sequence FIG . 33 480B 490 U.S. Patent US 11,360,740 B1 Sheet 29 of 40 Jun . 14 , 2022 Vools , the number of lists to be more and columis in the rectangles . 1 Vrons , we mumber of rows in the Sequence 2-0 array , !! Njinné , the number of values in the input wnsorted list and output sorted list, sapsar: The the last serience swier ; ( fierad . list of None suosien valses Output: The 3 -d list of Mana starteri values s Transfer input - list of Nancé unsorted values to Sequence 02 -Harony : 1/ Seovec 2-8 array kas Arown, rows and anal / Nrows ) columns 2 In Sequexe soxt auk solurn of the as array with as Vrows series ; The siminun velve in each column goes to row $ ; the maximum goes to row ( Nronisp : 1 ) . 1 / Nox selam Saquence merge sort . 3 Set sequence yamahl * Wrows = : Krows : s Vum rectangless Ninas/ ( ww ; * Munis ) . * sepseat / process each wea sort sexquence Use 2-4 array from Sequence ( - ) to create a recangles rectangles in Sequenca: ** 1 ! Each group of Vous columns in Sequence (0 - 1 ) array produ one rectangle Process the reciznglés tough the stages ofSexe Q : Transfer sorted data for the Sagenes en rectangles to the Sequence output 2mmy : 11 TM sorted we values in each rectangle become a column in the output array. it 92 the di sotsp the nextstano Q Nrow traw . * Veuils ; Num reconglese Num_recanglesq- 3 /Nols ; end is until 3 * Ons 17 The output array from Sexuance si fisuri has sinus roks and 1 colum . 36 Transfer the single onlum of Sequence gjinal output data to the inal - list of fina sorted values ; FIG . 34 UCMS Standard Sequence 2 Stages With Column Delta and Intkows Specifications Cousin Delta 5 3 2 Aš dows At Rows Ad Rowi AB Rows 8}} Rows 3 Nous A Noots - 5 Neais 4 Nous drols Inthous 8 Nicols ROWS AB ROWS A Rows servering 18štiat Row Sort 2013 Howcal Diag Rowico Dias ? ?? Howycol Dias V2 282 1/3 1/5 1/6 V3 24 ? ? HuwCol Disg Rowicol Bag How /col Días Row / cobias Intdows ? ?? Howy w Bias ? Fina Row sont The last 2 (Ncols - 1 } rows both refer to the last stage in anymerge sort Sequence et , with a FIG , 35 U.S. Patent Jun . 14 , 2022 Num . Max Sort Sort Row 3 3 thW 81 MC Diagonal Stage 16 831 4.1 21 he Sequence Mrows Columns Rectangles Stagesa Desta , Cols Rows 32 0 US 11,360,740 B1 Sheet 30 of 40 3 2 AWAN3 3 . ? w tu S :? YA 16 16. 20 22 3 18 FIG . 36 The original list of 8 unsorted values for the 3 - column non - standard UCMS example is 4 6 2 7 3 1 5 8 Sequence o : Unsorted List of 8 ; 1 List of Length 2¡2 Lists of Length 3 2 2 1 2 A 1 7 S 2 word mm 3 2 a 7 5 8 6 1 Lists of 2/3/3 Built But Not Yet Sorted Lists of 2/3/3 Built And Sorted Sequence 1 : Merge Sort of Final Rectangle Stage Flow is left to Right , Starting in Top Rectangle Row R / C : Row / Col Deltas Between Successive Diag Selections 2 Row ROW 1 0 5 2 1 2 2 1 3 0 Stage 1 : Build 3x3 Rectangle : Sort All Rows 2 ROW 2 3 2 Stage 2:11 Diagonal Sort 2 1 2 7 3 an 1 2 1 0 1 Stage 3 : 1/2 R / C DiagonalSort w 6 3 3 8 7 2 3 Final Sorted List of 8 Values FIG . 37 U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 31 of 40 sodule 1685 network Hals N final. 32 Miss Bits . values in lists input (781 input ansertendist che dimensimial ( 31 : 0 1 Output ( output some list * bits wire ( 7 $ 1 #sartelists dimansional { 31 : 91 for secure wire to sorted distribencz X wire ( 7 : 8 sorted lists wire 17:01 sorted lists with % CWS 16 coixis 1:15:01 { : 1:01 { 15 ; 0 from secience ( 72 from sequence [ 32 sequence encols 4.4 Final 32 Nike S105 3 8 3380HS [ : 3 : 13 1 colu 32 ] SpieNic.instance & ) ir data for sequence of ansörteist 2 timenet } out data from: seguence of sorted listrom serience Spisnice I int * sequence 3 Ncols 4 x final 32 NER its& in tata kr Sekilence 3 ( sorted Lists from sequence out data from_sequence2 sostenlists from sequencement samedli** . 21mm sequence spiritistait sequence 2 kcals final 32 Num dits8 In data for sequence 2 ( 501960sts from soovesico out data from seguonok ; * } 3 one 11155_index 203VBI generate for { One 1355 one 21st Selist 1 / 83for the to the 2 Index index > one 3st Index 1 index norted Input list Input array used in Sequence 33 py there are 2 r $ 65 9 14 colcAM in the Secrience array assign unsortex lists 2 For Sequence I list index 16 ( osie : list_index. % 15 issukcesorted List one dimensionall one 1st Index ) ; 14 trasfertso sisgis so Ihad prasis the last sequence - output, array to the 3.- Sorted output list ASSES sutsutsorted_list_one dimensional ( one listindex sorted lists2.requenc . { 15kmine 1 { } end endgenerata dadurch FIG . 38 U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 32 of 40 module sequence_oNcols 4 N final 32 Num Bits & tjeje hi 8 bits 2 BW'S input ( 7 : 0 ) in data_for_sequence_ ( 1101 ( 15 : 81 ? output ( 7 : 0 ) out data from sequence @ 11:01 | 15:01 col2013 index genver generate 1 Instantiate the 16 2n sorters for ( Column Index Column index column_index 15 colum Index - 1 sont z values 8 bits ( Y begin sequence o sort 2 In 1 ( in data forsequence Int ( in data for sequence 8x value ( 11 1 ) Out out data from sequence ( 1 Out ( out data from sequence 1 A M nin value endgenerate andaarule FIG . 39 column index 1 ) , column index 1 ) , > column_index ) ) column index ) ) 1 U.S. Patent Jun 14 , 2022 US 11,360,740 B1 Sheet 33 of 40 R8yg3 $$$$$$$$$$$$$$$ 43zh _ { { …… & & xts ??? tp * {{ * **} ??? 833814 *** { $o $ RK? # # 3 # # # # # # # # 3 { " st_333 { * * * } { } } } } & f 888* , ?????_ { ????? ? Generste for ( rectangle Index ?? ? %f wire xxxe Wg { { :*} {+ : 1 ( 7:01 { } ;#} + : #3 X { fews $$ { { :} } { } } * ?????? { : # # # # 2 diagtdata 11 ( ( 3 ???????????? { { : # { $ ; ? 30 ? { { ; # } { } ; $ Stay 4 ; } < 3g_sex++ } & ? gs * ? "C33gessex ? xt #} # #? #} a } : : ; } { ?? p ? £ ata } tt Sq ?:???? , # ?* , fry Set - seq_2_row_sort_instance { $$$tage ( qYg % 81 { " S ** { ata } . $$$$$$?ata { $$$ ?? $ ???? 3 , 6 ?? , ??.?ds stage w?} Jow sex3 1 > text{} ? ets seq_dial,22 instance sele stage { , … { { { sta{ $$$$ *** ?? t } } ????? : ata { ? & ( ? 3 seguancay stage diagonal stage with row delta 3 ?? ???sta?? * $ ??????? *** ???? { , ? ags_? atafosts { } } .??? ts { ? g SexNCK I , stage A , last stage in sequence diagonal stage with R/ C ??????ag_33_ {sta? Seg_ ??????? { » 33 & _433{ d À ¥ ts } } 14384 3 3 dkdata 413 4 $ { { ? ata } } & 1 3xas } } * * ?? das } } 3/3 3./(KCOLS-1 ) 3 ??t data } } ; {} ???????? ??? ?? ??? ? ?? ka 3 { 3 } { CB_Xxx ; ?? K } & ; ?? ?????? * * ??? > > ? . * ; }ssss8X < »• } {3 } split off 4 Colsons from sequence Input data to rectangle ros_sort_11_486x rray ? sig? ? 5 ? ka { fr ????? } { { { ? K In data for sequence CW_index 1 Col index + ( rectangleIndex * XCOLS ) } ; XXoay the final sorted data from this rectangle into & single column of the output data ? $$$ { {? 3 day' ??????????? { { ???????? * & } } { { ? } { " Eag{ { { X 3 ? a { { A ?at { { { x { { $ kxx ex } } 3 essed IX END OF " for rowindex ex? { ??? { ? * 3* { [ ??- ' Kur { ' ????? k ??? . 3 ??? RK, * * ; * ????????? ' ?????x3 FIG , 48 U.S. Patent Jun 14 , 2022 US 11,360,740 B1 Sheet 34 of 40 #xatu ? ? quit_K??.1_4katW A8 % & its ( ?? $ { {RAY { { } } & ??? t &_fast2 { ? utput { ? : ? } ???? $ # @YC & 2 { } } } } { } : 8 3 p3 "????? " Rectangle_index , colindex , row insiex j generate 0 ; rectangle index for { rectangleindex 3 { ft .& # xxx++ } … . 8 bits wif & } 7 : 3 ft Sorts ?? : { * * * } Kir { ? : } f ? KSp ???? ts { ? * * } { re { & 3?????ta { } : # } e { } } } } ? 3330 * 4 ta { } # } . ” { } } } } ????? { {???t? { } : ?.? { + 1 8 3 ?????ts { * * } { { start with sequence 2 , stage 1. row sort $$$?ste ???? { . % $$ * { { { { * # # ( 133 3 3 $ 3 3 # : : ; : : # # { # 8 # 1 } } } } } # } } } ; SAMSKRs13 { fo $ +k ???? } } , ** $ ??? { f { $²q? nce 2 , Stage 2 , ???????? stge w?t f? ? 33 * 1 , cg3 ? 3?*3 $ ??? aÁXX3state ????? xst pagate ????24_ Gu ? ata { ?? A { ,????33_3_data{ $ sequence 2x stage 3 , diagonál stage with row delta , coluniss delta 2 exastage $ { ??? ?8 ??: ? a } } ? k_e?? } } ???E_1_3_{ 3 { ??? ats { ?????? ??? } > ??????? { ????? a } }; / ? sequence 2 , stage ay diagonal stage wi03 A delta = mw coll delta $ ia2k3st333Cg ??????????????? { ????? ?? { ixGf4a ? ; } - ?????? SWX_ { ?????? } } } gitte * stage … , 1st stage 2 ???€xact : ????? $ ????? • 1 t { { NKH53 ) $ ².81stHg aist?RE ??. ; { + 44838xta { 41444fa3 } ?? + gu ? ad { ????? } } 1 } xap sequence input and output okta fe { X } &d for now index { } $ ? t of $ S j.col.index > # # ( 4Xxxsex** } best ; fjsex x * ; " ?????idgX •• } ??}}s from ????? ??? ??? # # #ectangle raspx_??? assign robsort .Indata ( Bow Index } [ col_index ) K in data for seigence_2 ( * _index 1 Col Index + ( rectangle index * XOLS ) 17 wap the sorted data from this rectangle into a single coluan of the output data assign out data from Stuence 2 Index * HQOLS > CO3 isdex tangle_index ) { { _ ? ???????? tg { " d ? ex } { { {}} { } } 7 end 17 ENS OF " for ( row_index ? d f} $ ?* ** p { ????????? - end 11 END OF **For ( rectangle index { ???" S Á ???? g X FIG , 41 U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 35 of 40 module seq_1_row_sort_stage 8 bits 2 rows 4 columns input [ 7:01 row sort in data output [ 7:01 Tow_sort_out_data ( 1 : 0 ] [ 3:01 YYY ) ; row index 11 Instantiate the 2 4- sorters for this rectangle genvar generate for ( row index rok index row index sort 4 values 8 bits 30 In 36 . In 26 . In 16 .In( w M row index Out ( Out 20 Out 10 .Out_o ( x $ 3 begin row sort_4 sorter row.sort in data row_sort_in data row_sort_in_data rowsort in data 1 / max value ? mys row sort out data row sort out data row sort out data row_sort_out data www 1 / min value end endgenerate endmodule FIG . 42 ( ( ( ( row index row_index row_index row index ] ] ]) ] 1 ( ( ( row row row row 1 ( 3 ) 1 1 2) 1 1 ] ] [ 0] M index index index index [ 33 kia)n adt rpetdan***orte[ltanlnget [ 1 [ 0] 3 ) U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 36 of 40 module seg 2 diag.2.1 stage ( port List 8 - bit values input 17:01 dia output ( 4 columns in data 1 diag 2 1 out data [70 ] [30 ] // The passthrough assignments Vyaviy NYWAWNAWOWAVNAVAT assign diag2 bout datal diag 2 1 indata ( offet ?teret Peter tete vy // The sorter block instantiations sort.4.values 8 bits ( Sorter From row 1 column 3 „ In3 ( diag.2.1 in data 1 71 ( .In ( diag 310 „ In_2 (( dian 21 in data 1 .In in data ! (( diag.21 in data 1 max value Out ( diag2 out Out 2 ( diag 2.1 out .Out ( diag22 out Out o ( diag_2_3_out Y data ( data data / data 1 endmodule FIG . 43 510 > 21 ) ; 31 ) 21 ) 31 ) 3 U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 37 of 40 ALGORITHM 2 : Czestion of Module: Cxte for a Rectangle Diagonal Stage: laput: Nrox . Nools, tirea number of rows and columns in the rectangle Input: Kox Delta , the amber of rows howmen successive sorter location selections Isput: O Delta , the winter of columns between successiva sortar location selections Output: The Systein Verlog Module Code for Sorters and Passthroughs : Initialize all locations of rectangle array Used in Sorier to 0. 1o . false ; 2. Initialize string variable: module code ; row limit Vrows - Berta Delta w . ) ; > for colctart - Vcois - downto Ca Dalia by - doreste all diagonal axters if al start < Vcois . Ca Deita thesi. row Jimit Rok Delaws; end for row.saari - O tu row limit to Initialize location list to empty : locations in fil next rOWY start ;123.co col start , : while (next ION < Nrows) and (next 10 do 1xt row .Wtcol still in the rectangle ? fututions in list locations in U15 3 next ruie ext_sOW Ruw uitu ; next 60 next 60 !mCol della ; if ( excations in list then found at least 2 locations to sort Initialize sorter text for {{ocations in lignorier ; foreach location in luxation list do i Preserve on to max order Wild fixation as next highest input 3ad sutput posts text of sortir ext ; 13 € 30 3 2 end Sex ilseduin Sorterfination to lietne ; end 23. Re ? as if EVEN Vcois ) and ( Nools fox and ( Col Delta This is the latRow Sort section , Noods - 1 = 1 to Vrows - 2 do Initialize sorterex for ( Voals wer : for cox xm 1 to Neois - 2 do arve min to inax iritas erid Add locatim ruw hum.cu ! bum as next highest input and cxUtput ports text of sorter text : Set Used_in_arter location to 1. 10. true ; finaliu sorter text and add it to module code : » forcach location in Used In Sorter do X Create passé routs * 1 tften * ( Vrous > 2 ) that Ostate passthrough text for lixection and add it tomule code * Finalize module code and write it to stage module $ V ale ; FIG , 44 U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 38 of 40 The original list of 9 unsorted values for the 3 - column standard UCMS example is 4 6 9 2 7 1 8 5 3 Sequence o: Unsorted List of 9 ; 3 Lists of Length 3 3 Row 2 Row Seura 2 mm 1 0 Baterie 1 3 0 ww 4 2 1 7 6 Song 1 Lists of 3/3/3 Built But Not Yet Sorted Lists of 3/3/3 Built And Sorted Sequence 1 : Find the Median of the Final Single Rectangle With 3 Rows x 3 Columns Stage flow is left to right , Starting in Top Rectangle Row R/ C: Row /Col Deltas Between Successive Diag Selections ROW 2 2 .. 11 I 87 7. Min of Max's Median of Medians www 1 Max of Min's Stage 1 : Build 3x3 Rectangle: Sort All Rows Row 2 1 0 2 I 5 Final Median of g Values FIG . 45 2 1 7 .. malo Stage 2 : 1/1 R / C Median of 3 U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 39 of 40 The original list of 25 unsorted values for the 5 - column UCMS example is 18 10 22 14 6 12 25 19 13 17 16 24 23 22 11 15 5 20 Sequence o : Unsorted List of 25 ; 5 Lists of Length 5 2 . 3 2 422 418 410 12 25 4 23 1 17 2 111 1 414 G 13 9 1 41621 4242 15 2 3 * - 2 . ) 23 4 3 5 2 3 27 21 24 20 model un le font 2 5 lists of 5 Built And Sorted 5 Lists of 5 Built But Not Yet Sorted Sequence 1 : Find the Median of the Final Single Rectangle With 5 Rows X 5 Columns Stage Flow is left to Right, Starting in Top Rectangle Row R / C : Row / Col Deltas Between Successive Diag Selections 4 2 1 22. 24 1 20 A 3 17 2 tom og 2 1 1 0 10 5 0 Min 1 of < ; Median of5 ; Max 1 of 4 Mid 3 085 ; Max 3 085 ; Max 2 of 5 2 2 Stage 2 : R/ C1 / 1 Diagonal Stage 1 : Build 5x5 Rectangle: Sort Rows From Top Row: Min2 of 5 ; Min 3 of 5 ; 3 20 *** 3 HOW I 4 w 3 3 2 3 2 1 2 1 . 13 --} Stage 3 : R / C 1/2 Diagonal: Median of 3 Median of 25 has been determined FIG . 46 0 U.S. Patent Jun . 14 , 2022 US 11,360,740 B1 Sheet 40 of 40 The original list of 25 unsorted values for the S- column UCMS example is 18 19 22 14 6 12 25 8 19 13 17 16 24 9 23 2 21 13 11 15 5 20 Sequence 0: Unsorted List of 25 ; 5 Lists of Length 5 How 4 3 2 N Jamah 1 3 2 12 25 8 4 um 23 13 9 2 7 1 1 15 5 3 2 25 22 1I w 16 24 2 3 Row 1 ? 20 5 Lists of 5 Built But Not Yet Sorted 5 Lists of 5 Buit And Sorted : Max Row Sequence 1 : Find the Max of the Row of Max's Stage Flow is left to Night, Starting in Top Rectangle Row 2 4 2 25 22 2 I 24 20 3 2 2 1 1 . Stage 1 : Sort the Max Row Vax of 25 has been determined FIG . 47 I 0 1 US 11,360,740 B1 2 SINGLE -STAGE HARDWARE SORTING BLOCKS AND ASSOCIATED MULTIWAY MERGE SORTING NETWORKS CROSS - REFERENCE TO RELATED APPLICATIONS 5 determine a max or min value until it has performed a full list sort. Both algorithms use merge sorts of 2 sorted lists , and the only single - stage hardware sorters that are used in these sorting networks are 2 - sorters. For merging two large, sorted lists , John von Neumann's Merge Sort is typically used . However, the basic algorithm is very slow , as only the max or min of 2 values is selected in each clock cycle . Because of this, merge sequences from O - EMS sorting networks are often used to increase the This application claims the benefit of U.S. Provisional Application No. 62 / 984,880 , filed on Mar. 4 , 2020 , all of 10 number of output values in each clock cycle . which are incorporated by reference . Rank order filters may be used to select an element from FIELD OF THE INVENTION an ordered output list . Rank order filters do not produce a full list of sorted values from an unsorted list . Rather, they The invention relates generally to sorting lists of values in produce only a partial list of the sorted values , and often hardware. More specifically, the invention relates to single- 15 there is only one filtered value that is output. Typical rank stage sorting blocks and associated multiway merge sorting order filters produce the max , median, and / or min values networks. from an unsorted input list . Multiway merge sorting net works may be used as rank order filters, for example, to sort BACKGROUND OF THE INVENTION Hardware sorting systems use single - stage 2 -sorters, or processes. These single - stage hardware blocks have 2 input values , a block which compares those 2 input values , and the comparison result signal is used as the output multiplexer (MUX ) select line, or control input signal, for the block's output ports. A 2 -max filter only presents the maximum (max ) of the 2 inputs, a 2 - min filter presents the minimum (min) of the 2 inputs. A 2 - sorter presents both the max and min sorted outputs. A schematic of a 2 - sorter, with both 2 -max and 2 - min output ports, is shown in FIG . 1. A hardware 2 - sorter may be made into a 2 -max or 2 -min rank order filter by removing the output multiplexer logic for the output port not used , but there is no propagation delay improvement for such a rank order hardware block . Propagation delay is the time required for an input signal to propagate to an output along the slowest path in a singlestage or network sorting block . Single -stage hardware N - sorters directly sort more than 2 values at a time when N = 3 . Certain 3 - sorters, for example those for a 3 -way merge sort process , create their sorters from 3 serial stages of 2 - sorters . Therefore , these 3 - sorters are very slow, taking 3 times longer than a single 2 - sorter. A sorting network using these 3 - sorters becomes a two level network of 2 - sorter networks. A sorting network consists of a network of small single - stage hardware sorters and filters, connected in such a way as to sort lists larger than what can be sorted by a single - stage sorter or filter. The small N - sorters and N - filters used in traditional sorting networks are 2 - sorters and 2 -max and 2 - min filters. An advantage to N - sorters when N23 include that fewer hardware resources may be used for a single - stage hardware sorter versus a multi -stage network of 2 -sorters . Single stage hardware 2 -sorters may be connected to operate in parallel in each stage of the sorting process . This is considered a sorting network with a purpose to sort unsorted input values in a fast and efficient manner , and to output the full sorted list of those same values . When a sorting network only uses 2 - sorters, even small lists with more than 2 values must be sorted with a sorting network . Single stage sorting blocks are used in various sorting algorithms, such as Odd- Even Merge Sort (O - EMS ) and Bitonic Merge Sort. Both algorithms take the same amount of time to sort a list of values , but Bitonic Merge Sort uses more hardware resources in its networks than O - EMS . O - EMS can also be used to build fast max or min sorting network rank order filters, but Bitonic Merge Sort does not network median rank order filters or to sort network max and 20 min rank order filters. What is needed is an improved system and methods for designing single -stage hardware sorting blocks , and further using the single - stage hardware sorting blocks to reduce the number of stages in multistage sorting processes, or to define comparators , and 2 -max and 2 -min filters in their sort 25 multiway merge sorting networks. The invention satisfies this need . SUMMARY OF THE INVENTION 30 The invention is directed to a general methodology for the systematic design of single - stage hardware N - sorter with N23 . All of the hardware sorters produced in accordance with this and the following hardware N - sorter embodiments produce a “ stable sort ” . That is , any duplicated values in the 35 input list are distributed to the output ports in the same relative order found in the input list . This may be important, for example, when the values to be sorted are keys in key /value pairs . The single - stage sorting blocks comprise a set of at least 40 3 input values , contained in one or more lists . There is one list of sorted output values , containing the input values , now in sorted order. A full sorter presents all sorted output values , and aa filter presents a subset of the full sorted list . The output ports are defined using output multiplexers, one port mul 45 tiplexer per each output value bit . At least three 2 - input comparisons are implemented in parallel. The comparison result signals may be used directly as select lines for the output multiplexers, or they may be combined in various ways in order to define the output 50 multiplexer (MUX ) select lines , or control input signals . The multiplexer select line operations inside the output bit mul tiplexers are all performed in parallel. The systematic design of single - stage hardware N - sorters according to the invention is appropriate for any type of 55 hardware in which a design can be implemented using a Hardware Description Language (HDL ) , such as a Field Programmable Gate Array (FPGA ). It is contemplated that the invention may be implemented in any known HDL language, for example, System Verilog ( SV) . It is further 60 contemplated that the invention may be implemented in C ( including C ++ ) language. The invention is also directed to single - stage rank order N - filters , which present as outputs M only a subset of the N sorted inputs, with M < N . N - filters also work on a list of 65 totally unordered input values . Some of these N - filters, such as hardware median filters , simply output values from the full sorted output list , without any change in the design for 3 US 11,360,740 B1 4 the specific values that are output. However, single -stage FIG . 20 illustrates bit multiplexer behavioral code . hardware N -max and N -min filters are often specially FIG . 21 illustrates pseudocode for 4 - min and 4 -max designed in order to improve the speed of the filters, versus single stage hardware filters. FIG . 22 illustrates pseudocode for 5 -max single stage the speed of the associated full N - sorter. The invention is also directed to single - stage N - sorters 5 hardware filters . used to enable fast multiway merge sorting networks. A FIG . 23 illustrates pseudocode for 8 -max single stage multiway merge sorting network includes one or more hardware filters . merge sequences, in which 3 or more sorted lists are merged FIG . 24 illustrates aa flow chart for creating N -to - M filter into a single sorted output list . After the final merge from a general hardware N -sorter. sequence, all of the unsorted inputs are presented in a full 10 FIG . 25 is a table of UCMS 4 - column sorted order. FIG . 26 is aa table of UCMS notations and abbreviations. sorted output list of those unsorted input values . The invention is also directed to the design of rank order FIG . 27 is a UCMS sorting network example for sorting network filters, where only a subset of the sorted Sequence 0 : 4 -column , Nfinal = 32. output values are produced and provided as filter outputs . FIG . 28 is a UCMS sorting network example for These rank order sorting network filters have reduced 15 Sequence 1 : 4 - column, Nfinal = 32 . resource usage , versus the corresponding network that outFIG . 29 is a UCMS sorting network example for puts all of the sorted input values . In some cases , such as Sequence 2 : 4 - column , Nfinal = 32 . max and min sorting network filters , the filter speed is much FIG . 30 is a UCMS sorting network example for sequence faster than the corresponding network which outputs all of flow : 3 - column, Nfinal = 9, Ncols =3 . the sorted input values . Max and min multiway merge 20 FIG . 31 is a UCMS sorting network example for sequence sorting network filters , where 3 or more max /min values are flow : 2 -column, Nfinal = 8, Ncols =2 . merged in each stage , are also shown to be much faster than FIG . 32 is aa table of a combined equation. prior art max - and - or -min sorting network filters using 2 -way FIG . 33 illustrates aa block diagram of a top level UCMS merge sort, which are restricted to only using 2 -max and network . 25 FIG . 34 is an algorithm for the top level UCMS network . 2 -min single - stage hardware filters. The invention and its attributes and advantages will be FIG . 35 is a table of UCMS Sequence 1 stages . further understood and appreciated with reference to the FIG . 36 is aa table of various parameters and stage order: detailed description below of presently contemplated Nfinal = 243 , Ncols =3 . embodiments , taken in conjunction with the accompanying FIG . 37 is a non - standard sequence flow : Nfinal = 8 , drawings. 30 Ncols =3 . FIG . 38 is code for 4 - column UCMS example: Nfinal = 32, DESCRIPTION OF DRAWINGS Ncols =4 . FIG . 39 is code for 4 -column UCMS example , Sequence The preferred embodiments of the invention will be 0 : Nfinal = 32, Ncols =4 . described in conjunction with the appended drawings pro- 35 FIG . 40 is code for 4 - column UCMS example, Sequence vided to illustrate and not to limit the invention . 1 : Nfinal= 32, Ncols =4 . FIG . 1 is a block diagram illustrating a prior art 2 - sorter. FIG . 41 is code for 4 -column UCMS example, Sequence FIG . 2 is aa block diagram illustrating a general hardware 2 : Nfinal= 32, Ncols =4 . N - sorter. FIG . 42 is code for 4 - column UCMS example row sort, FIG . 3 illustrates code for a port list creation . 40 Sequence 1 : Nfinal = 32 , Ncols = 4 . FIG . 4 is a flow chart directed to the design steps of a FIG . 43 is code for passthrough and 4 -Sorter instantiation general hardware N - sorter. from 4 - column example Stage : R / C = 2 / 1 . FIG . 5 illustrates code for comparison signals. FIG . 44 is an algorithm used to create module code for a FIG . 6 illustrates code for output port assignments. diagonal stage. FIG . 7 is a flow chart directed to the design steps for 45 FIG . 45 is a median of 3x3 window using UCMS building multiplexer select line signals . sequence flow : Nfinal= 9, Ncols = 3 . FIG . 8 illustrates code for product terms. FIG . 46 is a median of 5x5 window using UCMS FIG . 9 illustrates 3 - sorter code created using the general sequence flow : Nfinal = 25, Ncols = 5 . hardware design embodiments according to the invention . FIG . 47 is a max of a 5x5 window using UCMS sequence FIG . 10 is a block diagram of a modified general hardware 50 flow : Nfinal = 25, Ncols = 5 . N - sorter. FIG . 11 is a hardware sorter table . FIG . 12 illustrates propagation delay and resource usage of N - sorters and N -max filters using a 4 - LUT logic block . DETAILED DESCRIPTION The invention is directed to designing single - stage hard hardware sorting blocks to reduce the number of stages in multistage sorting processes, or to define multiway merge sorting networks. FIG . 13 illustrates a 4 - sorter code according to the inven- 55 ware sorting blocks, and further using the single - stage tion . FIG . 14 illustrates a 5 - sorter code according to the inven- tion . FIG . 15 illustrates another embodiment of aa 5 - sorter code The invention is discussed with respect Hardware according to the invention . 60 Description Language (HDL ) in the form of System Verilog FIG . 16 illustrates OR Signals for 6- , 7- , 8- , and 9 - sorters ( SV) for exemplary purposes only ; any HDL is contem in 2nd MUX select line block . plated . It is further noted that the invention may be imple FIG . 17 illustrates a 9 - sorter Sum of Products ( SOP ) mented in C (including C ++ ) language. equation in 4 -LUTs . Single - Stage Hardware N -Sorter FIG . 18 illustrates code including input equations com- 65 FIG . 2 is aa block diagram illustrating a general hardware bined . N - sorter 100 according to the invention . These hardware FIG . 19 illustrates bit multiplexer code . N - sorters sort a list of N input values , and return the full 5 US 11,360,740 B1 sorted list of the same N values as outputs . A single -stage hardware sorter has one set of N input ports, one set of N output ports, and whatever internal logic is needed to produce the sorted list of values at the output ports. At the output ports, a single -stage hardware N - sorter produces a fully sorted list of N values for any permutation of the N input values . In contrast to a single - stage hardware sorter, a network sorter has multiple operation stages . In each stage of the network sorter, several single - stage hardware N - sorters operate in parallel. Network sorters, using multiway merge sort, are discussed further below. For any hardware sorter in this embodiment, the unsorted input list of N values is applied to the sorter input ports, which are labeled In_Nml down to In_0, where Nm1 is the number N - 1 . The sorted output list of values is presented at the sorter output ports, which are labeled Out_Nml down to Out_0, with Out_Nm1 being the maximum value , and Out_0 the minimum value . The various embodiments are discussed with respect to target 8 - bit unsigned numbers for exemplary purposes only. FIG . 3 shows a SV port list code for a 9 - sorter. As shown , the input and output ports are unsigned values with bit indices from MAX_BIT_INDEX down to 0. The number BITS PER VALUE is then defined as (MAX_BIT_INDEX + 1 ) . In this figure, MAX_BIT_INDEX is equal to 7 , so BITS PER VALUE is 8 ; the input and output ports are 8 - bit unsigned values . Although the example port list shown in FIG . 3 is used for 8 - bit unsigned numbers, any number type and any bit width is contemplated. FIG . 4 is aa flow 200 chart directed to the design steps of a general hardware N - sorter according to the invention . As shown in FIG . 2 , the N - sorter 100 includes a Comparison Signals Block 120 , a Output MUX Select Line Signals Block 140 , and an Output MUX (Multiplexer ) Block 160 . . 6 produces a stable sort, a sort in which the output order of duplicate values ( e.g. , keys in key /value pairs ) is the same as the input order of those duplicate values . It should be noted that any enforced is contemplated so 5 long as groups of duplicate values are processed as if they are distinct values , and the order of duplicate values in the output list matches the relative order of those values in the input list. FIG . 5 illustrates the code for the 36 comparison signals 10 for a 9 - sorter. Each of the N input values is compared, one at a time , to every other value . This specification uses the “ X ” “ greater than or equal ” operator for each comparison , and the comparison signal names all begin with " ge ” to help emphasize the comparison operator that is being used . 15 It should be noted that aa sorter smaller than aa 9 - sorter uses a subset of the code shown in FIG . 5 , A 2 - sorter only needs the ge_1_0 declaration , a 3 - sorter only needs the ge_2_1 , ge_2_0 , and ge_1_0 declarations, and a 4 - sorter requires only the ge_3_2 , ge_3_1, ge_3_0 , ge_2_1 , ge_2_0, and 20 ge_1_0 declarations. For a sorter smaller than aa 9 -sorter, the unneeded declarations listed can be disregarded ( e.g. , deleted or commented out). For a sorter larger than a 9 - sorter comparison variables are added , for example a 10 - sorter adds 9 comparison variables from ge_9_8 down to ge_9_0 , 25 in which In_9 is compared to the other 9 In_X's. The ge_9_8 variable would compare In_9 to In_8 , and the ge_9_0 variable would compare In_9 to In_0. In these additional signal comparison definitions, In_9 is always on the left side of the comparison operator. 30 The Output MUX Block 160 of FIG . 2 is also found in every N - sorter. In this block, for each of the N output ports , one of the N data inputs is selected to go to that particular output port. More specifically, as shown by step 208 of FIG . 4 , a set of multiplexers is provided, with each multiplexing The Comparison Signals Block 120 is the first design 35 having N data input signals and N - 1 multiplexer select line block in any single - stage hardware N - sorter. As shown by signals, i.e. , whatever select line input signals are required in step 202 of FIG . 4 , a list of N unsorted data input values are applied to input ports , where N23 , and each N - sorter internal 2 order to choose the correct input data line to be sent to the multiplexer output. As shown in FIG . 2 , the data lines come input data value is supplied by an input port. The Compari- directly from the input ports to the multiplexers, and enter son Signals Block 120 performs, in parallel, all possible 40 the group of Output MUX Blocks 160 at the top . The 2 -value comparisons for the N input values as shown by step multiplexer select line signals enter the group of Output 204. This is performed using a comparison operator to MUX Blocks 160 from the left, and are delayed by the generate, in parallel, all N * (N - 1 )/ 2 possible 2 - value com- amount of series logic used to produce the select line signals . parison result signals for the list of N data input values . It Output port assignments are created in aa straightforward should be noted that it is assumed that efficient comparison 45 manner , as shown in FIG . 6. Output port assignments may hardware is created whenever a comparison of 2 values is use ternary or conditional syntax , and use multiplexer select specified by a given hardware type. As a result , there may be line signals to determine which of the N inputs goes to a no need to modify any of the 2 -value comparison hardware particular output. Since there are N input signals and N - 1 blocks that are automatically created . The input which is input MUX select line signals in each output port assign located higher in the input list is on the left side of the 50 ment, there are always (2 *N) -1 input signals per assignment comparison operator, and the input which is located lower in in the general hardware design . As an example, a 9 - sorter output assignment would have (2 * 9 ) –1 = 17 input signals in the input list is on the right side of the operator. The following is discussed with respect to a comparison the assignment. operator that is ‘ greater than or equal' ( 2 ) for exemplary In the Output MUX Select Line Signals Block 140 shown purposes only. This is one embodiment of the invention and 55 in FIG . 2 , the MUX select line signals required by the any comparison operator is contemplated . Output MUX Block 160 are built . The multiplexer select At step 206 , an order is enforced for identical input line signals propagate through an amount of series logic values . An input value located higher in the input list is used to produce the multiplexer select line signals. judged to be greater than an identical input value located Using Hardware Description Language ( HDL ) in the form lower in the input list . This enforced order - in which the 60 of System Verilog ( SV) , the MUX select line signals have a input value on the left side of the “ * ” operator must have a “ In_X_goes_to_Out_Y ” naming convention . The MUX larger numeric suffix than the input value on the right side select line signals determine which In_X input value goes to of the operator — is essential for at least two reasons. a particular Out_Y port. For example, when one of these First , the enforced order allows groups of duplicate values signals is a 1 , then that particular In_X input value is to be successfully sorted in the same manner as if all input 65 distributed to Out_Y. For a particular Out_Y signal, a values were distinct. Second , when the enforced order is maximum of one In_X_goes_to_Out_Y signal can have a combined with “ ?” comparison operator, an N - sorter always value of 1 for a specific set of N input values . It should be US 11,360,740 B1 7 8 ing . It should be noted that the counting is not performed in correctly processes duplicate values , and produces a stable noted that there is no In_0_goes_to_Out_Y signal used in The common feature for each product term in FIG . 8 is that the conditional assignment. If none of the each product term in the SOP equation has 5 wins . In_Nml_goes_to_Out_Y down to In_1_goes_to_Out_Y At step 314 , it is determined which output port the input signals are true , then In_0 must be the input value that goes value is assigned to , which is indicated by the number of to output Out_Y. “ wins” . The invention provides a general hardware design Each In_X_goes_to_Out_Y signal is defined by a Sum 5 with straightforward creation of Comparison signals, Output of - Products ( SOP ) equation, in which each product term MUX Select Line signals , and Output MUX signals , that contains the true or complemented signal states for the N - 1 produce an efficient and fast hardware N - sorter that correctly comparison signals in which In_X is compared to other processes duplicate list values , and produces a stable sort of 10 input values . duplicate list values as well . FIG . 9 shows the SV code for The In_X_goes_to_Out_Y multiplexer select line signals a 3 - sorter designed in accordance with the invention , which may be created according to a version of comparison count the hardware that is ultimately built , but that the counting is sort of those duplicate values. In_X_goes_to_Out_Y SOP signal , which is then imple Advantageously, the above described general design sys mented in hardware in a simple manner, for example by being installed in a Look Up Table ( LUT) described in tem and methods may be modified for use in FPGAs or when using a particular hardware type. Examples of hardware further detail below . At step 212 of FIG . 4 , a sorted list of values is output to 20 types include a logic block with either 4 or 8 6 - input Look output ports, wherein the order of duplicate values in the Up Tables ( LUTs ), and a set of 2 - to - 1 multiplexers used to output list matches the order of those values in the unsorted combine LUT outputs , if needed . input list. For discussion purposes, a 4 - LUT design logic block is FIG . 7 is a flow chart 300 directed to the design steps for used that has 4 LUTS, 3 2 -to - 1 -multiplexers, 27 LUT and building multiplexer select line signals according to an 25 multiplexer select line inputs, and 7 outputs. These logic embodiment of the invention . At step 302 all 2N -1 possible blocks may be referred to as “ slices ” or “ slice logic blocks ” . product terms are created for each of the N data inputs, with When adapting the general N - sorter design methodology for each product term containing all of the N - 1 comparison use in the target FPGAs, the speed of the N - sorter operation performed in the process used to create a particular 15 Single -Stage Hardware N -Sorter with Particular Hardware Type signals for this input, and with each comparison signal is considered by minimizing the number of series slices that specified in its inverted or non - inverted state . At step 303 , a 30 an N - sorter's slowest signals propagate through from the product term is selected . At step 304 , it is determined if the input ports to the output ports . Also , the number of LUTS data input signal is on the left side of the comparison needed for each output multiplexer as well as the total 9 operator, and the comparison signal state is non - inverted . If number of LUT resources required for a given sorter design “ yes ” , a “ win ” is assigned at step 308. If no , at step 306 , it are minimized . is determined if the data input signal is on the right side of 35 FIG . 10 illustrates a block diagram of aa modified general the operator, and the comparison signal state is inverted . If hardware N - sorter according to the invention . According to “ yes ” , a “ win ” is assigned at step 308. After a “ win ” is this embodiment, the general hardware N - sorter design is assigned at step 308 , it is determined if this is the last modified for logic blocks with LUTs and multiplexers, e.g. , comparison for the product term at step 309. If no ” , the next a logic block with 4 6 - input LUTs or an 8 - LUT logic block . comparison result signal and its state in the product term is 40 As shown in FIG . 10 , the N - sorter 400 includes a Compari selected at step 303. At step 310 , the “ wins ” for each product son Signals Block 410 , two Output MUX Select Line term are summed . Once the Number_of_Wins is determined Signals Blocks 420A , 420B , and an Output MUX (Multi for a given product term , that product term is added to the plexer ) Block 440. Each block in FIG . 10 represents a group SOP equation for signal In_X_goes_to_Out_ (Num- of slices operating in parallel, and the number of slice groups ber_of_Wins ). 45 in series is listed for each of the possible paths that go A11 2N -1 product terms are distributed to the various through the Comparison Signals Block 410. The possible In_X_goes_to_Out_Y equations. During the creation of the paths through the Comparison Signals Block 410 are the various In_X_goes_to_Out_Y equations, an In_0 slowest paths, the paths that determine propagation delay. goes_to_Out_Y equation can be created . However, as men- The fastest sorters are those in which the slowest signals tioned previously, In_0 goes_to_Out_Y signals are not used 50 propagate through only 2 slice groups, and the slowest in the Output MUX Block , so no In_0 goes_to_Out_Y sorters are those in which its slowest signals travel through equations are put into the SV code for the hardware sorter all 4 slice groups in FIG . 10 . embodiments. FIG . 11 is directed to a table that lists various parameters For an N - sorter, there are N In_X_goes_to_Out_Y equa- for both the general design embodiment sorters and the tions for each of the N - 1 inputs, from In_Nml down to 55 sorters created in this LUT sorter embodiment. Row 3 of this In_1. There are then a total N * (N - 1 ) In_X_goes_to_Out_Y table lists the number of N data inputs, plus the number of equations created in the SV code , each of which is ultimately comparison signals required to sort those inputs. In this data used as a MUX select line signal in one of the N Output row , it can be seen that both a 2 - sorter and 3 - sorter have 6 MUX Block equations. or fewer such signals . As a result , the associated At step 312 , each product term is added to the input's 60 In_X_goes_to_Out_Y signals is implemented in the same particular SOP equation in which each product term in the 6 - input LUT that implements an Output Multiplexer. There SOP equation has that same number of “ wins ” . An example fore, the signals for these two sorters propagate only through of such an In_X_goes_to_Out_Y SOP equation is shown in the Comparison Signals Block and the Output MUX Block FIG . 8 , which shows the 56 product terms for the 9 - sorter shown in FIG . 10 . signal In_5_goes_to_Out_5 . The highlighted product term 65 When there are Output MUX Block 440 changes, changes in FIG . 8 contains the state of ge_7_5, and the 1 states of are required for the output MUX select lines as well . The ge_5_4 , ge_5_3 , ge_5_1, and ge_5_0 , for aa total of 5 wins . select line signal changes are implemented in the 1st MUX US 11,360,740 B1 9 Select Line Signals Block 420A and possibly the 2nd MUX Select Line Signals Block 420B . Since the 2 - sorter and 3 - sorter output bit multiplexers 10 A bit output multiplexer for a 4 - sorter can be fit into a single 6 - input LUT, using the 4 - to - 1 multiplexer design discussed above . For sorters larger than a 4 - sorter, more than only require sorter input data and comparison result signals 1 LUT is required per output bit multiplexer. In this embodi as their inputs, these two sorters have the minimum 2 series 5 ment, multiple LUTs required for an output bit multiplexer slices . With the input signals for both the 2 - sorter and are placed in the same slice logic block . For 5 - sorters up to 3 - sorter propagating through only 2 slice logic blocks , both 8 -sorters , 2 LUTs are required per output bit multiplexer. sorters have approximately the same propagation delay. The The outputs of the 2 LUTs are combined in a 2 - to - 1 signals for these two sorters propagate only through the multiplexer to produce the final bit multiplexer output. Comparison Signals Block 410 and the Output MUX Block 10 SV code used to build the 5 - sorter Out_2 bit multiplexers 440 with the signal flow path identified as “ 1 2 - sorter is shown in FIG . 14. Out_2 assignment code using the Equivalent” in FIG . 10 . principles of the general hardware design embodiment is The estimated propagation delay and the LUT resource shown towards the top of the figure, but is commented out. usage values for the single - stage hardware N - sorters dis- 15 The assignment has 5 data inputs, and 4 select line inputs, cussed in this embodiment are shown in the top half of the for a total of 9 inputs. The uncommented code below this table of FIG . 12. The LUT resource usage values in this table shows how this assignment is modified and distributed to 2 assume that the data values are 8 - bit unsigned integers. The LUTs and their common MUX Block . LUT_A is effectively data in the bottom half of the FIG . 12 table pertains to a 2 -sorter, and LUT_B a 3 -sorter. single - stage hardware rank order filters, discussed further 20 Simple SV behavioral code is used to define LUT_A and below. LUT_B , and this behavioral code defines 2 LUTs for each bit The first single - stage hardware sorter that requires SV of the input data, i.e. , 2 LUT_A and LUT_B LUTs for each code modification in this embodiment is the 4 -sorter. In Row bit of the input values . The outputs of these two LUTs are 3 of the FIG . 11 table , it can be seen that the 4 - sorter and combined in the same slice logic block that contains them . larger sorters have significantly more than 6 data input plus 25 Because of this, structural code is used to instantiate “ primi comparison signals. In order to implement an output mul- tives ” in order to combine the outputs of the 2 LUTs. The tiplexer, none of these sorters fit the comparison result primitive only handles signals with aa bit width of 1 , so an SV signals and the N input values into a single 6 - input LUT. “ generate ” block is used to separately instantiate one primi However, if In_X_goes_to_Out_Y signals are separately tive per bit of the output port values . The MUX select line created and then used as the output MUX select line signals, 30 signal is In_4_OR_3_goes_to_Out_2 , and the code used to it may be possible to fit all of the needed select line signals, create it is shown in FIG . 15. With the inclusion of this plus the N input values , into a single output MUX LUT. This signal, the select lines for the Out_2 output bit multiplexers requires the input signal data flow to go through at least the created the bottom of FIG . 15 now contain all of the 1st MUX Select Line Signals Block 420A shown in FIG . 10 , functionality of the 4 select line signals shown in the so that In_X_goes_to_Out_Y signals can be defined from 35 commented Out_2 assignment at the top of the figure. the comparison signals created in the Comparison Signals The creation of the In_4_OR_3_goes_to_Out_2 signal Block 410. To see if it is possible to fit all of the select line shown in FIG . 15 is similar to the creation of the two 4 - sorter signals, plus the N input values , into a single 6 - input LUT, refer to Row 5 of the FIG . 11 table . For the 4 - sorter, this row of the table indicates that 7 such signals are needed , one more than can be fit into a 6 - input LUT. At first glance , it appears that, for the 4 - sorter, more than 1 6 - input LUT will be needed per output value bit . FIG . 13 shows SV code for the 4 - sorter output port Out_2. In the SV code implemented in the 1st MUX Select Line Signals Block 420A , the functionality of the three select line signals, In_3 goes_to_Out_2, In_2 goes_to_Out_2, and In_1 goes_to_Out_2, is combined into 2 select line signals, In_3_OR_2_goes_to_Out_2 and In_3_OR_1_goes_to_Out_2. The * _goes_to_Out_2 truth table in FIG . 13 shows how this functionality is combined . The uncommented SV code in FIG . 13 shows the definitions of signals In_3_OR_2_goes_to_Out_2 and In_3_OR_1_goes_to_Out_2, how they are combined into 2 -bit bus mux selects Out_2 , and how this 2 -bit bus is used in the final Out_2 assignment. Since there are only 6 signals in this final Out_2 assignment, each output bit multiplexer fits into a single 6 - input LUT. Each output bit LUT is a 4 -to - 1 multiplexer, with 2 select lines and 4 data lines . The SV code for the 3 other 4 - sorter output port assignments is written in the same way that the Out_2 code is written . The input signals must now propagate through 3 slice logic blocks in series . The middle slice block is the 1st MUX Select Line Signals Block 420A shown in FIG . 10. Since the 4 - sorter input signals must propagate through 3 slice logic blocks , the 4 - sorter propagation time is estimated to be 1.5 2 - sorter equivalent time units . 40 45 50 55 MUX select line signals shown in FIG . 13. However, the SOP equation for the 5 - sorter In_4_OR_3_goes_to_Out_2 signal contains a total of 7 comparison signals, so this SOP equation cannot be fit into a 6 - input LUT. A 7 - input LUT can be created using 2 LUTs and their common MUX Block , shown in FIG . 15 . The general hardware embodiment equation for signal In_4_OR_3_goes_to_Out_2 is displayed inside SV com ments at the top of FIG . 15. The In_4 and In_3 portions of this OR equation contain a common comparison signal , ge_4_3 . The portions of the commented equation in which ge_4_3 is a 1 are broken out into a separate LUT equation, and the same is done for the portions of the equation in which ge_4_3 is a 0. The ge_4_3 term is removed from each modified equation, and then ge_4_3 is used as the MUX select line for the block that combines the two LUT signals. Unlike the MUX instantiations shown in FIG . 14 , the MUX instantiation shown in FIG . 15 is not placed inside an SV generate block . All of the MUX input and output signals in FIG . 15 are simple signals, with a default bit width of 1 . The discussion and figures referenced above use output port Out_2 as an example. The other 4 output ports are 60 designed in aa like manner . The input signals for this 5 - sorter travel through 3 slice logic blocks in series, like those of the 4 - sorter. So the propagation delay of the 5 - sorter, also like the 4 -sorter, is estimated to be 1.5 times the 2 - sorter propa gation delay. 65 The output bit multiplexers for the 6- and 7 - sorters are similar to the 5 - sorter multiplexers whose SV code is shown in FIG . 14. For a 6 - sorter, both output bit multiplexer LUTS 11 US 11,360,740 B1 are effectively 3 - sorters . The 7 - sorter has one output bit 12 The ge_8_5 and ge_7_5 signals are now used as MUX is aa 4 - sorter. LUTs, as is shown in FIG . 18. This is the same type of The MUX select line signals for these two sorters are process used to create the 5 - sorter signal defined in an equation which ORs 3 In_X_goes_to_Out_2 5 In_4_OR_3_goes_to_Out_2 signal, shown in FIG . 15 , and signals. Behavioral code for these two MUX select signals, the 8 - sorter In_X_goes_to_Out_Y signals, except now there the 6 - sorter's In_5_OR_4_OR_3_goes_to_Out_2 signal and are two levels of MUX Blocks used to combine the LUT the 7 - sorters's In_6_OR_5_OR_4_goes_to_Out_2 , is shown outputs. at the top of FIG . 16. All of the equations shown in FIG . 16 10 Each of the 9 - sorter's In_X_goes_to_Out_Y signals now are used for output port Out_2. The equations for the signals requires 4 LUTs, which significantly increases a 9 - sorter's multiplexer LUT that is effectively 3 - sorter, and one that select line signals in order to combine the outputs of the four used for other output ports are easily constructed in the same resource usage . However, a 9 - sorter's resource usage also increases due to another factor. Since there are now 9 data manner. If these two signals are created directly using this behavioral code, then an additional series slice is needed in order to produce the OR signals . Instead , these two signals are created using additional slice resources not previously discussed , carry chain logic . The slice carry chain logic is used 15 inputs for each output bit multiplexer, an output bit multi plexer no longer fits into 2 LUTs . At least 3 LUTS per bit are now required. A portion of the 9 - sorter 3 LUT design for Out_2 is shown in FIG . 19. Once again , output port Out_2 is used as an automatically by the synthesis tool when creating 2 -value example . All of the other output ports are designed in a comparison signals , but this logic can also be used for other 20 similar manner . This design uses all of the logic in a 4 - LUT purposes, such as creating AND , OR functions of the 6 -input slice logic block . Output MUX select line signals shown for LUT outputs. this design are shown at the bottom of FIG . 16. Since these It is posited that one skilled in the art can create a 3 - LUT OR signals are created in the 2nd MUX Select Line Signals OR signal using the carry chain logic . When the carry chain Block, shown in FIG . 10 , the input signals for this 9 - sorter logic is used , the slowest 6 - sorter and 7 - sorter signals still 25 design propagate through 4 slice logic blocks in series . propagate through only 3 slices in series, just like the Although only 3 LUTs are used to produce output bit The output bit multiplexers for the 8 - sorter are similar to those of 5- , 6 , and 7 - sorters, as they all use 2 LUTs per signals in this design , the design appears to use all 4 slice logic LUTs. Row 11 in the FIG . 11 sorter table notes that this use of 3 LUTs in a slice logic block may effectively uct terms for an 8 - sorter require a 7 - input LUT, which is created in a single slice logic block using two LUTs and their Row 7 in the FIG . 11 sorter table shows that a 10 - sorter requires 8 LUTS in a slice block for each slowest 4 - sorter and 5 - sorter signals. output value bit . The output MUX select signal for the 30 monopolize the use of all 4 LUTS. 8 - sorter, In_7_OR_6_OR_5_OR_4_goes_to_Out_2, is an As noted earlier, the propagation delay and hardware OR of 4 In_X_goes_to_Out_2 signals, and is shown in the resource usage values for the 2 - sorter up to 9 - sorter designs, middle of FIG . 16 . implemented using the 4 - LUT slice logic block , are shown Row 6 of the FIG . 11 table shows that there are 7 in the top half of FIG . 12. The LUT resource numbers in this comparisons in each In_X_goes_to_Out_Y product term for 35 table for the 9 - sorter assume that all 4 LUTs in each output an 8 - sorter. The 8 - sorter's individual In_X_goes_to_Out_2 multiplexer slice block are used . SOP signals require 2 LUTs and their associated MUX , so Until now, this set of embodiments has focused on designs the carry chain logic cannot be used to produce the OR in which the primary logic portions of hardware sorter signal. designs are implemented in , and take advantage of, a 4 - LUT In this case , the 4 - LUT OR signal is produced in an 40 slice logic block such as that found in multiple Xilinx FPGA additional series slice , in FIG . 10's 2nd MUX Select Line product families. In two other Xilinx FPGA product fami Signals Block , and the slowest sorter signals now propagate lies , Ultrascale and Ultrascale + , Xilinx provides an 8 - LUT through 4 slice blocks in series . The process for creating an slice logic block . 8 - sorter's 7 - input In_X_goes_to_Out_2 equation is the An 8 - LUT slice logic block is essentially a combination essentially the same process that was shown for creation of 45 of two 4 -LUT slice logic blocks , plus one additional 2 - to - 1 the 5 - sorter signal In_4_OR_3_goes_to_Out_2 signal, pre- multiplexer, which combines MUX outputs of two 4 - LUT viously discussed and shown in FIG . 15 . logic block groups. As provided above , all of the 4 - LUT Implementation of Hardware 9 -Sorters Using 4 Logic sorter designs discussed above can be implemented in this Blocks in Series 8 - LUT slice logic block as well . Designs that can only be As mentioned just above , the In_X_goes_to_Out_Y prod- 50 met with an 8 - LUT slice logic block are now discussed . common MUX Block . There are 8 comparison signals in In_X_goes_to_Out_Y product term . Only an 8 LUT slice each 9 - sorter In_X_goes_to_Out_Y product term , as is listed block can be organized as the 9 - input LUT needed for these in Row 6 of the FIG . 11 table , so an 8 - input LUT is required 55 signals. for these signals . As is shown in Row 7 in the sorter table , Fitting the 9 - input LUT signals into the 8 - LUT slice block the 9 - sorter's 8 - input LUT requires the combination of 4 uses the same basic procedure used to fit the 9 - sorter's 6 - input LUTs in a single slice . 8 -input LUT signals into a 4 - LUT slice . The 9 - sorter pro An example of how this is done uses the 9 - sorter cedure was previously discussed and referenced FIGS . 8 , 17 , In_5_goes_to_Out_5 SOP equation shown previously in 60 18. For the 10 - sorter, 3 comparison signals are removed FIG . 8. This equation is broken up into 4 sections using from each In_X_goes_to_Out_Y product term , and these 3 blank lines. In each section, there is a specific paired state for signals are used as the MUX select lines . signals ge_8_5 and ge_7_5. Each of these sections is now The 10 - sorter output bit multiplexers are implemented placed into a separate LUT signal , as shown in FIG . 17 , and using 3 LUTs in a slice . As with the 9 - sorter output bit the ge_8_5 and ge_7_5 comparison signals are removed 65 multiplexers, a MUX Block is required for such a design , so from the equations. Each equation now contains only 6 it is reasonable to assume that this design monopolizes all 4 comparison signals , and therefore fit in a 6 - input LUT. LUTs whose outputs ultimately feed into the MUX Block . US 11,360,740 B1 13 Using the 8 - LUT slice logic block, it is possible to construct a 4 - sorter in which the input signals propagate through only 2 FIG . 10 logic blocks , just like the 2 - sorter and 3 - sorter input signals. FIG . 20 displays behavioral code for output port Out_2 5 indicating how this 4 - sorter is designed . The FIG . 20 code 14 blocks shown in FIG . 10 , as the inputs signals for these N - sorters already have the minimum possible propagation delay. The hardware N - sorters which already have the minimum possible propagation delay are the 2 - sorter and the 3 -sorter, when designed with either of the slice logic blocks . Single - stage max and min filters for N24 values have is developed by initially creating all 24 ( 4 factorial) permu- reduced propagation delay because the X_goes_to_Out_Y SOP equations for the max and min tations of the distinct numbers 3 , 2 , 1 , and 0 , and treating In_X_ each permutation as a 4 - sorter input list . The states of the output values are unique. These SOP equations contain only 4 - sorter's 6 comparison signals are determined for each of 10 one product term . Therefore, only one state of a component the 24 permutations. For a given output port, 8 LUT equa- comparison signal is possible in an In_X_goes_to_Out_Y tions are created , one for each permutation of the 3 com- equation when Out_Y is the min or max value in the output parison signals ge_3_2 , ge_2_1 , and ge_1_0 . The compari- list . Furthermore, when a given comparison signal is found son signals available for each LUT equation are the other 3 in aa 2nd In_X_goes_to_Out_Y equation for the same min or comparison signals, ge_3_1, ge_3_0, and ge_2_0 . Finally , 15 max Out_Y, the state of this comparison signal in the 2nd these 8 LUT equations are combined using 2 -to - 1 multi- equation will always be the opposite state from that found in plexers, with the comparison signals ge_3_2 , ge_2_1, and the 1st equation. Examples of these unique max and min SOP equations are ge_1_0 used as MUX select lines . The single - stage hardware sorter discussed above pertain shown in FIG . 21 , which shows SV pseudocode for both to a full N - to -N sort of N input values . In an N - to -N 20 4 -max and 4 - min hardware filters . The single - stage hardware sorter, all N values become output In_X_goes_to_Out_Y equations are commented out , since values , except that now they are in a stable sorted order. the In_X_goes_to_Out_Y signals themselves are not used . Single - Stage Hardware Rank Order Filters Rather the comparison signals are used directly to create the Now, single - stage N -to - M hardware sorters are discussed , output bit multiplexers. 9 in which M < N . In other words, only the output ports for 25 SV pseudocode shows SV equations, but without “ assign ” certain rank positions in the sorted list are created in the statements and “ wire ” declarations. Behavioral 2 - to - 1 mul hardware. These types of sorters are often called rank order tiplexer pseudocode is used in place of generate blocks and filters. Rank order filters often produce only a single output structural instantiations. The example SV code referenced in (max - filters, min - filters , median - filters ), but can produce the application permits one skilled in the art to use the several outputs such as a lowest - 2 - of - 5 - values filter. 30 behavioral pseudocode examples referenced in this embodi FIG . 24 illustrates a flow chart 350 for creating N - to - M ment set to build successful rank order hardware designs. filter. At step 352 , a hardware N - sorter is provided. At step As mentioned above , the propagation delay and hardware 354 , all of the unused outputs are removed as well as all of resource usage values for the 2 -max up to 9 -max filter the logic that was only used for the removed outputs. At step designs , implemented using the 4 - LUT slice logic block, are 356 , a single - stage hardware N -to - M filter is created . All of the N * (N - 1 )/ 2 comparison signals are still required. At its simplest, a N -to - M hardware filter has reduced hardware usage , but the same propagation delay as the full N - to -N hardware sorter. An N -median filter always has approximately the same propagation delay as the full N - sorter, as the In_X_goes_to_Out_Y SOP equations for the median value in an N - sorter, with N odd, always have both states of each comparison signal in its various product terms. Examples of single - stage hardware N - median filters, which are easily created from the associated N - sorter, are 3 -median, 5 -median , 7 -median , and 9 -median filters . Single - stage hardware N -median filters are important in applications to reduce noise . For example, finding the median of 9 values may be aa task used to reduce noise in 3x3 pixel windows in images . This is normally implemented in multistage networks of 2 - sorters, but can now be performed faster using a single - stage 9 -median hardware filter created from a hardware 9 -sorter. In the bottom half of FIG . 12 , propagation delay and LUT resource usage data for single - stage hardware N -max filters is listed, for filters implemented in a 4 -LUT slice logic block . The propagation delay and hardware resource usage 35 listed in the bottom half of FIG . 12. The details of these N -max designs, starting with the 4 -max design , are now described followed by N -max designs that require the use of an 8 - LUT logic block . The equations found in FIG . 21 that are used to create the 40 4 -max and 4 -min outputs show the unique characteristics of the min and max In_X_goes_to_Out_Y SOP equations. These unique characteristics allow min and max filters to be easily implemented using the comparison result signals directly, in combination with the slice 2 - to - 1 MUXF * mul 45 tiplexers and ternary / conditional notation for LUT equa tions . Although the 4 - sorter input signals propagate through 3 of the logic blocks shown in FIG . 10 , the 4 -max and 4 - min input signals only propagate through the minimum 2 blocks , the Comparison Signals Block and the Output MUX Block . 50 Note that the 4 -min comparison signals in the In_X_goes_to_Out_Y equations are the same as found in the 4 -max equations, but the 4 -min comparison signals always have the opposite state from the states found in the 4 -max equations. For the larger filters discussed in the rest of this 55 embodiment set, only N -max filters will be defined . One skilled in the art will have no problem creating a comparable N -min filter using the N -max equations. An N - max compact table is shown in commented lines of aa 9 -median filter is also listed , as the 9 - median values below the final Out 3 equation in FIG . 21. This table shows match those of the 9 -max filter. The equivalence of the 60 which comparison signals and signal states direct a particu 9 -max and 9 -median data is emphasized using shading in When using a slice logic block , it is possible to create N -max and N - min hardware filters that are faster than the associated N - sorter. The propagation delay improvement is 65 lar input to the max output port. This type of table is used by itself to guide the design of any hardware N -max filter, and this is exactly what has been done when creating the equations for larger N -max filters in the remainder of these embodiments . the full hardware N - sorter only travel through 2 of the logic similar to creation of a 4 -max filter shown in FIG . 21 . FIG . 12 . not possible for hardware filters when the input signals for FIG . 22 shows how a 5 -max filter is created , in a manner 15 US 11,360,740 B1 16 Commented In_X_goes_to_Out_Y equations are no longer According to the invention , if a UCMS network merges k shown, and they are replaced by a compact table which sorted lists , then single - stage hardware 2 - sorters up to displays the same information . The 5 -max design uses all of k -sorters will be connected in the UCMS network . The use the resources in the 4 - LUT slice logic block . The input of carefully designed single - stage hardware N - sorters, signals for this 5 -max filter propagate through only the 5 which sort 3 or more values at a time , is what allows a minimum 2 logic blocks shown in FIG . 10 , so the 5 -max UCMS multiway merge sort network to operate faster, filter is estimated to have the same propagation delay as a sometimes using fewer hardware resources, than 0 - EMS 2 -sorter. networks. The systematic design of the UCMS networks SV Pseudocode for an 8 -max hardware filter is shown in FIG . 23. The input signals for this hardware 8 - max filter also propagate through only 3 of FIG . 10 logic blocks in series , 10 even though 8 - sorter input signals propagated through all 4 incorporate the single - stage hardware sorters described above . When designing a merge sort process , UCMS combines the input sorted lists as columns in aa 2 - d rectangular struc of the logic blocks in series, However, the 8 -max filter ture, and then performs a sequence of operations on the output bit multiplexers now require 3 LUTs per output bit . 15 rectangular structure, in order to produce a single sorted list Because the slice logic MUXF8 block is used in this design , in the rectangle. The number of sorted lists to be merged is therefore called Ncols , the number of columns in each it is reasonable to assume that the design effectively uses all rectangle . 4 slice LUTs per output bit . The final sorted order for a 4 -column, 8 -row UCMS Definitions of the 2 In_X_goes_to_Out_7 signals, and the rectangle is shown in FIG . 25. There are 32 distinct values 4 In_Xa_OR_Xb_goes_to_Out_7 signals are not shown . 20 this rectangle , 32 down to 1. The UCMS sorted order is However, a skilled designer will be able to create the ina row major order, with the maximum list value at the top definitions of these signals, based on the previous example left, and the minimum value at the bottom right of the code . . Single - Stage N - Max Hardware Filters Using 8 - LUT Slice rectangle 26 provides a table of notations for UCMS rect Logic Blocks are now discussed . A 6 -max design is imple- 25 FIG .and the overall UCMS multiway merge sort network . mented in an 8 - LUT slice logic block using ge_5_4, ge_3_2 , angles The columns in a UCMS rectangle are numbered from and ge_1_0 as the mux select lines . The inputs for such a (Ncols - 1 ) in the leftmost column to 0 in the rightmost design propagate through only the 2 minimum logic blocks column . The rows in the Sequence ?q rectangle are numbered shown in FIG . 10 , and therefore this 6 - max filter has an from (Nrowsq - 1 ) in the top row to 0 in the bottom row . The estimated propagation delay that is the same as that of a 30 maximum value in each sorted column is found at the top , 2 - sorter. The details of such an 8 - LUT 6 - sorter design are in row (Nrowsq - 1 ), and the minimum value is found down left to one skilled in the art, using the 4 - LUT 4 -max and in row 0. Likewise , the maximum in a sorted row is 5 -max design principles described above , and shown in FIG . found to the left, in column (Ncolsvalue - 1 ), and the minimum 21 and FIG . 22 . The 9 -max filter design using an 8-LUT slice block is very 35 value is found, toliststheof right , in column . be merge sorted In principle any length Nfinal0can similar to the design using a 4 - LUT slice block , and very similar to the full 9 - sorter design. However, the slowest signals for the design using the 8 - LUT block propagate through only 3 slice blocks in series, versus 4 series slices for the 4 -LUT design. In the common design, there are two 40 OR -of - 3 signals used as output mux select signals. The bottom two signal definitions in FIG . 16 show examples of how these signals are created for aa 9 - sorter, and the bottom section of FIG . 19 shows how they are used . When using a 4 - LUT slice block for a 9 - max design , these signals are 45 created in the 2nd MUX Select Line Signals Block at the bottom left in FIG . 10. However, when using an 8 - LUT slice block for aa 9 -max design , these signals, the slowest signals in the 9 -max design , are created in the 1st MUX Select Line Signals Block . Therefore, the slowest 9 -max signals, using 50 an 8 - LUT slice block , now propagate through only 3 series slices . Multiway Merge Sorting Networks inorder a UCMS network, whenever Nfinat> Ncols. However, in to simplify the discussion of UCMS networks, a " standard ” UCMS network is defined, one which satisfies Equation ( 1 ) below with all four parameters being positive integers: Nfinal = Nrows, * Ncols @ final Equation ( 1 ) The four parameters in Equation ( 1 ) are all positive integers, and they are defined in the FIG . 26 . The UCMS sorting network discussions that follow will primarily reference a 4 - column standard UCMS example, in which Nfinal = 32; Ncols =4 ; Nrows = 2 ; qfinal = 2. FIG . 27 shows the sort operations in Sequence 0 of this 4 - column example. The merge sort sequences for the 4 - column UCMS example , Sequence 1 and Sequence 2 , are shown in FIG . 28 and FIG . 29 , respectively. Standard UCMS 3 - column and ( prior art O - EMS ) 2 -col A group of sorting networks, and the equations and umn examples are also shown, in FIG . 30 and FIG . 31 , algorithms needed to build such networks is referred to as an 55 respectively. The 3 - column example parameters are Nfi Unified Column Merge Sort, or UCMS for short. A UCMS nal = 9 ; Ncols =3 ; Nrows. = 3; qfinal = 1, and the 2 - column 2 sorting network will be built in hardware, presumably using a type of hardware such as those designed using a Hardware Description Language ( HDL ) . example parameters are Nfinal = 8; Ncols =2 ; Nrows, = 2 ; qfi nal =2 . In these figures, a sequence of arrows in a single line The UCMS sorting networks use merge sort algorithms, 60 identifies a group of values to be sorted, and then placed which merge 3 or more sorted lists of values into a single back into the same rectangle locations, but now in sorted sorted list . The UCMS system can also be used to build order. The Sequence 0 arrows indicate column sorts , where sorting networks which merge 2 sorted input lists in a single all values in each column are sorted . Merge sort sequence sorted output list . The main advantage of the UCMS system arrows either indicate a row sort or a diagonal sort. For a row is in its ability to create fast and resource - efficient multiway 65 sort, all selected values are in the same row . For a diagonal merge sort networks, in which 3 or more sorted lists are sort, which will be discussed in more detail later on, the selected values are all in different rows and columns . merged into a single sorted list . 17 US 11,360,740 B1 The arrows point from the location where the minimum value will be placed toward the location where the maximum value will be placed . For a sort group of locations 18 network itself. The streaming interface block would be used to transfer data back and forth between a host computer and the UCMS network constructed in hardware. along a diagonal , the minimum sorted value will be put in FIG . 33 suggests that a list of unsorted data is streamed the bottom left diagonal location , at the arrow base , and the 5 into the hardware from aa host computer, and the list of sorted maximum sorted value will be placed in the upper right data is then streamed back out from the hardware to the host diagonal location, at the arrow point. In a sort group of computer. However, the input list of data to be sorted may locations for a row sort, the minimum sorted value will be already reside in memory located in the hardware or directly put in the farthest right arrow location , the arrow base , and accessible to it . The UCMS output list of sorted data may the maximum value will be put into the farthest left arrow 10 also be written to memory inside the hardware or accessible location, at the arrow point. For the Sequence 0 column sort, to it . the minimum sorted location and the arrow base is at row 0 ; FIG . 34 displays the algorithm which shows the top level the maximum value and the arrow point is at the maximum UCMS network flow , from the input 1 - d unsorted list of Sequence O row location, Nrows , -1. values to the output 1 - d sorted list of those same values . The After sorting, the sorted minimum value will go to the 15 standard flow begins with the set of parallel hardware sorts leftmost location in a sort group , and the sorted maximum in Sequence 0 , and then progresses through a series of merge value will go to the rightmost location in the sort group . sort sequences, until the final 1 - d sorted list has been There is one diagonal sort group of 4 values shaded in FIG . produced 29 . As specified in the FIG . 34 algorithm , the 2 - d array of A UCMS sorting network always contains at least one 20 values in Sequence 0 has Nrows , rows, and (Nfinal/Nrows.) merge sort sequence , and it may contain several. The num- columns. Each column of the 2 - d array is then sorted with ber of merge sort sequences in a standard UCMS network is an Nrowso - sorter. given by positive integer parameter final . Since a merge sort After Sequence 0 , the algorithm shown in FIG . 34 loops sequence requires sorted input lists , there must be aa mecha- through each of the merge sort sequences, numbered 1 to nism to create the initial sorted lists . It is assumed that 25 qfinal. In each merge sort sequence, the single input 2 -d hardware NrowSo -sorters are used to create the initial sorted array has Nrowsq rows and (Nfinal/Nrowsq) columns, with lists , in a stage called Sequence 0. Sequences 1 and higher each column of data sorted from a maximum at row will always be merge sort sequences. Nrowsq - 1 to a minimum at row 0. In Sequence q , each Sequence 0 for the 4 - column UCMS example is shown in successive set of Ncols columns in this input 2 - d array is FIG . 27. As is shown in the first row of the table in FIG . 32 , 30 then split off from it and used to form a rectangle , with there are Nfinal/Nrows, hardware sorters in Sequence 0 , and Nrowsq rows and Ncols columns in the rectangle. The each single - stage sorter is an Nrows , -sorter. For the 4 -col- number of rectangles in each Sequence q is : umn example, there are then (32/2 ) = 16 2 - sorters in Num_rectanglesq = Nfinal/(Nrowsq * Ncols). Sequence 0. For the 3 - column example shown in FIG . 30 , there are ( 9/3 ) =3 3 - sorters in Sequence 0. Sequence 0 for the 35 In the final sequence, Sequence final, there is only 1 2 - column example shown in FIG . 31 has ( 8/2 ) =4 2 - sorters . rectangle. As shown in the FIG . 34 algorithm , Once again , for each of the 3 UCMS examples, each Nrows, = Nrows, when q = 1 and Nrowsq = Nrowsq_1 * Ncols column of the Sequence 0 2 - d array is sorted by a hardware when q> 1 . These equations can be combined in the second Nrowso -sorter. After sorting, the column values remain in row , Nrowsq column, of the FIG . 32 table , for q21 . The the same column, but are now in sorted order, with the 40 combined equation is Nrows. * Ncolsq - 1 maximum value in row (Nrowse - 1 ), and the minimum value Also shown in the FIG . 34 algorithm , in row 0 . Num_rectangles , = Nfinal/ (Nrows, * Ncols ) when q = 1, and The direct sort Sequence 0 is a single - stage sequence . Num_rectanglesq = Num_rectanglesq - 1 /Ncols when q> 1 . Merge sort sequences have 2 or more stages . In each “ stage ” , These two equations can be combined , when qz1 , and this all of the sort operations are performed in parallel, using 45 combined equation is Nfinal/ (Nrows. * Ncols ? ) . hardware sorters. Historically, a sorting network stage After the single - stage Sequence 0 , the first merge sort always had the propagation delay of a 2 -sorter, since only sequence is called Sequence 1. If NfinalsNcols ?, Sequence 2 - sorters were used in each stage . UCMS stages typically 1 is also the last merge sort sequence. Sequence 1 is a contain hardware sorters other than 2 - sorters, and the stage template for any merge sort sequences after Sequence 1 , as propagation delay is the propagation delay of the slowest 50 all of the stages in Sequence 1 are found in any later hardware sorter in the stage . To standardize stage propaga- sequence . tion delay values , the propagation delay of the slowest Note that Sequence 1 is the last sequence in the Ncols =3 hardware sorter is referenced to the propagation delay of a FIG . 30. In this example, Nfinal = 9sNcols2= 32 = 9, SO 2 - sorter in as reasonable manner as is possible . Sequence 1 is the last merge sort sequence. FIG . 30 shows For the 3 standard UCMS examples, all of the Sequence 55 that the single Sequence 1 rectangle is in correct sorted order O single - stage hardware sorters are either 2 -sorters or 3-sorters , both of which have essentially the same propagation after the last Sequence 11 stage . FIG . 35 shows the Sequence 1 stages for sorting networks with Ncols =2 to 9 , reading network constructed in hardware . The UCMS network itself row delta and column delta , when moving from one selected Interface to Host Computer block is not a part of the UCMS corner, at row 0 , column ( Ncols - 1 ) . If the row delta and delay when using hardware design blocks with 6 - input down the appropriate column. In the first stage in any merge LUTs, such as the 4 - LUT slice logic block discussed above . sort sequence, each row in each rectangle is sorted . Any Therefore, when using the 4 - LUT slice logic block , all of the 60 stage after the initial row sort stage contains “ diagonal” sort example Sequence 0 stages have a propagation delay operations. In a diagonal sort stage , values to be sorted in a equivalent to the propagation delay of 1 2-sorter. hardware sorter are selected along a diagonal in the rect Refer to FIG . 33 , which gives a top level view of a UCMS angle. Each of the diagonals for a given stage has a specific encompasses the “ UCMS Sorting Network Top Level” 65 value to another selected value along the diagonal . block and the blocks connected below it . The Streaming There is always a diagonal starting from the bottom left 19 US 11,360,740 B1 column delta values ( R / C ) are both 1 ( 1/1 ) , then the next selected value will be at row 1 , column (Ncols - 2 ) . If Ncols and Nrowsq are both >2 , then the next selected value will be at row 2 , column (Ncols - 3 ) , and so on . Given a specific R / C value set , all possible diagonals are defined , and the values along each diagonal are sorted . In Sequence 1 , the stage that follows the initial row sort stage is always an R / C 1/1 diagonal stage . In FIG . 31 , there are only 2 stages in Sequence 1 , the initial row sort stage and the R / C 1/1 diagonal stage . This matches the Ncols =2 column in the FIG . 35 table . When (Ncols >2 ) in Sequence 1 , there are additional diagonal stages after the R / C 1/1 stage . Each additional stage has a constant row delta of 1 , and the column delta increments by 1 , relative to the previous stage . The next stage after the R / C 1/1 stage is then an R / C 1/2 stage . If there is a stage after the R / C 1/2 stage , it will be an R / C 1/3 stage , and so on . The last stage in any sequence has an R / C value of 1 / (Ncols - 1 ) . This behavior is easy to see in FIG . 28 and FIG . 30. In the Ncols = 3 example FIG . 30 , there are 3 stages in Sequence 1 , and the last stage has an R / C diagonal of 1/2 . In the Sequence 1 FIG . 28 for the Ncols =4 example, there are 4 stages in Sequence 1 , and the last stage has an R / C diagonal of 1/3 . It should also be clear from these examples and the data in the FIG . 35 table that the number of stages in Sequence 1 is equal to Ncols . The last row in the FIG . 35 table is labelled “ Final Row Sort " . The information in this row indicates whether there is 20 In the 4 - column UCMS example Sequence 2 , there is one extra stage relative to Sequence 1 , as can be seen when comparing FIG . 28 and FIG . 29. The extra stage is Stage 2 , which is inserted between the Stage 1 row sort stage , and the 5 R / C 1/1 Stage 3. Stage 2 has an R / C diagonal value of 2/1 ; there is a row delta of 2 and a column delta of 1 between successive diagonal location selections . In the 2 - column UCMS example Sequence 2 , there is also one extra stage relative to Sequence 1 , as shown in FIG . 31 . 10 Once again , the extra stage is Stage 2 , which is inserted between the Stage 1 row sort stage , and the R / C 1/1 Stage 3. The 2 - column Stage 2 also has an R / C diagonal value of 2/1 . Once Nrowsq is known for a Sequence q, with q> 1 , the 15 number of extra stages and the row delta for the first extra stage are calculated . The row delta for the first extra stage , Stage 2 in the sequence , is also the maximum row delta in the sequence diagonal stages . The extra stage calculations are shown below : 20 Number_Extra_Stagesq = CEILING (log2(Nrows , 2 Ncols ) ). Maximum_Row_Deltaq = Stage_2_Row_Deltaq = 2Number_Extra_Stagesq For Sequence 2 in the 4 - column example, these equations 25 evaluate to : Number_Extra_Stagesz = CEILING (log2( 8/4 ) ) = 1. Maximum_Row_Deltaz = Stage_2_Row_Delta2 = 29= 2. an “ IntRows" sort in the last sequence stage . An IntRows sort is a sort of the internal values in each internal row of the 30 And for Sequence 2 in the 2 - column example, these sequence rectangles. The internal values in a row are the equations are nearly identical: values in column Ncols - 2 down to column 1 , all columns Number_Extra_Stagesz = CEILING (log2 (4/2 ) ) = 1. except the leftmost and rightmost columns. The internal rows are rows Nrowsq - 2 down to row 1 , all rows except top Maximum_Row_Delta2= Stage_2_Row_Delta2 = 2 + = 2. and bottom rows . An IntRows sort is required whenever 35 Ncols is an even number >2 , and Nrowsq > 2. When Ncols =4 , Sequence 2 in the 4 - column and 2 - column examples only Ncols is obviously an even number >2 . However, in the had one extra stage , with a row delta of 2. If a merge sort Ncols =4 Sequence 1 example shown in FIG . 28 , sequence has several extra stages , the row delta is divided by Nrowsq = Nrows, = 2 . There are no internal rows , so there is 2 for each successive stage after Stage 2. As mentioned no IntRows sort in Stage 3 , the final Sequence 1 stage . 40 above , the final extra stage always has a row delta of 2. The An IntRows sort is shown in Stage 5 of the 4 - column last extra stage for any merge sort Sequence q, with q> 1 , will Sequence 2 FIG . 29. In this case , Nrowsz = 8 and there are 6 always have an R / C value of 2/1 . internal rows . In each of the 6 rows, a 2 -sorter is used to sort In the 4 - column and 2 - column UCMS examples, the last the values found at columns Ncols - 2 =2 and Ncols - 3 = 1 . sequence was Sequence 2. In both cases , Sequence 2 had As mentioned above , Sequence 1 is a template for any 45 only 1 extra stage , when compared to the associated additional merge sort sequences . Any Sequence q, where Sequence 1. The table in FIG . 36 lists parameters and the q22 , will have the same stages as Sequence 1 , plus 1 or more stage execution order for a more comprehensive example. extra stages . This means that Sequences 2 and higher will The data in the table has been calculated for a UCMS include all of the stages shown in the FIG . 35 table . The network with Nfinal = 243 = 35; Ncols = 3 . There are 24 stages extra stages for Sequences 2 and higher are inserted after the 50 in this network flow , and the stage order is indicated using initial row sort stage , and before the R / C 1/1 diagonal stage . the numbers in the columns in the right portion of the table , Going from Sequence 1 to Sequence 2 , the number of starting with the Sequence ( “ Sort All Cols” column . The rows in a rectangle is multiplied by Ncols , and the number shaded stages in this table are the extra stages for Sequences of rectangles is divided by Ncols . For the 4 - column UCMS 2 , 3 , and 4 . example, there are 2 rows shown in each Sequence 1 55 A UCMS Ncols = 2 merge sort algorithm operates on rectangle shown in FIG . 28 , and there are 2 * Ncols =2 * 4 = 8 rectangles in which the 2 columns are constructed from the rows in each Sequence 2 rectangle shown in FIG . 29. There 2 sorted input lists . In O - EMS , the two sorted input lists are are 4 Sequence 1 rectangles, as shown in FIG . 28 , and there split into odd and even lists . The odd and even lists are is 4 /Ncols = 4 / 4 = 1 rectangle in Sequence 2 , as shown in FIG . separately sorted , and then merged together in the last 29 , so Sequence 2 is the last sequence for the 4 - column 60 sequence stage . example . The equivalence of the two algorithms is displayed in the For the 2 - column UCMS example, as shown in FIG . 31 , 2 - column example shown in FIG . 31. In this figure, the even there are 2 rows in each Sequence 1 rectangle and there are lists consist of the rectangle locations with even row num 2 * Ncols =2 2 =4 rows in each Sequence 2 rectangle. There bers, which are shaded , and the odd lists are the rectangle are 2 Sequence 1 rectangles and there is 2 /Ncols = 2 / 1 = 1 65 locations with odd row numbers. rectangle in Sequence 2 , so Sequence 2 is also the last In the 2 -column Sequence 1 , the first stage is the row sort stage , in which even and odd rows are separately sorted . The sequence for the 2 - column example. 21 US 11,360,740 B1 22 last stage of the 2 - stage Sequence 1 is the R / C 1/1 diagonal FIG . 40 shows the Sequence 1 SV code for the UCMS stage . This is the stage in which the sorted odd and even lists 4 -column example . A generate block is used to instantiate are merged together. the series of 4 stage modules, which is performed once for In the 2 - column Sequence 2 , there is an extra stage each of the 4 Sequence 1 rectangles. Stage rectangle output between the row sort stage and final R / C 1/1 stage . This 5 data easily becomes the rectangle input data for the follow intermediate stage is aa diagonal stage with an R / C values of ing stage . Two levels of “ for ” loops are used to split off the 2/1 . Notice that in this intermediate stage , the sort operations sequence input data into groups of 4 columns , which become rectangle input data for the first row sort stage . These for only occur between values in the same odd or even list . In the loops also transfer the sorted rectangle output data from the the final stage , which is once again the R / C 1/1 stage , the 10 last stage into a 1 - d output list , which becomes 1 column in sorted odd list and the sorted even list are merged together. the sequence 2 - d array. The Sequence 2 SV code This short example does indicate that the O - EMS and shown in FIG .output 41 creates Sequence 2 hardware in the same UCMS Ncols =2 algorithms are the same. that the code in FIG . 40 did that for Sequence 1 . FIG . 37 shows the sequence and stage flow for a non manner FIG . 42 contains the SV code for the initial Sequence 1 standard UCMS sorting network example with Nfinal = 8 ; 15 stage , the row sort stage . A generate block is used to Ncols =3 ; Nrows = 2 , 3 , 3 ; qfinal = 1. This sorting network is instantiate one 4 - sorter per rectangle row, in the same way derived from the standard UCMS Ncols =3 example, shown that a generate block in the FIG . 39 code was used to in FIG . 30. Effectively, the upper left rectangle location is instantiate one 2 - sorter per column of the Sequence 0 2 - d removed from the FIG . 30 example in order to produce the array. The SV code for a diagonal stage tends to be more With the upper left rectangle location now gone , complex . Each diagonal sorter is instantiated separately , not Sequence 0 in FIG . 37 is modified versus FIG . 30 , since in a generate block . It is possible that a generate block may there are only 2 3 -sorters, and 1 2 - sorter used in the FIG . 37 be used for some diagonal stages , but that is not discussed Sequence 0. Stage 1 in Sequence 1 , the row sort stage , is also here. Not all rectangle locations are connected to a sorter in 25 a diagonal stage . Those locations that are not connected to modified , and for the same reason . The unsorted input list of 8 values for FIG . 37 is the same a sorter are " passed through ” the stage . that was used for the standard Ncols = 2 flow in FIG . 31 . FIG . 43 shows one passthrough and one diagonal 4 - sorter When comparing the two figures, it is clear that 6 stages are instantiation , from the Sequence 2 R / C 2/1 diagonal stage . needed for the standard Ncols =2 flow , but only 4 stages are The passthrough location is at the upper left of the rectangle , needed for the non - standard Ncols =3 flow , while sorting the 30 and is shaded in the Stage 2 rectangle in FIG . 29. The same set of 8 values . As noted earlier, stages with 3 - sorters diagonal locations from the 4 - sorter in FIG . 43 are also have the same propagation delay as stages with 2 - sorters, shaded in the Stage 2 rectangle in FIG . 29 . when the design is implemented using hardware with 6 -inThe algorithm used to create SV source code for any put LUTs. The non - standard 3 - column UCMS sorting net- diagonal stage module is shown in FIG . 44. Given the FIG . 37 flow . 20 work has a speedup of 6 /4 = 1.5 versus the state -of -the - art 35 rectangle size , and the diagonal R / C value , the algorithm O - EMS sorting network , identical to the UCMS 2 -column produces instantiations for all diagonal sorters, all pass sorting network . throughs, and, when appropriate, all IntRows sorters . The O - EMS / 2 - column sorting network uses 19 2 - sorters The UCMS sorting network system , as discussed above , in its Nfinal = 8 sorting network, as shown in FIG . 31. Also is a unified and methodical system , utilizing single - stage as noted earlier, a 3 - sorter uses 1.8 times the resources of a 40 hardware N - sorters instantiated in multiway merge sort 2 -sorter, when designing with 6 - input LUTs. In the non- networks. It is assumed that this system can be modified for standard Ncols =3 ; Nfinal = 8 sorting network, as shown in FIG . 37 , there are 6 2 - sorters and 5 3 - sorters. So the total equivalent 2 -sorter resources in this network is improved performance in certain ways . For example, it has been shown just above that a sorting network with Nfinal = 8 was designed to be quicker and use 6+ ( 5 * 1.8 ) = 6 + 9 = 15 . Even though the non - standard 3 - column 45 fewer resources when using a non - standard Ncols =3 multi UCMS sorting network has a speedup of 1.5 versus the way merge, versus the prior art Ncols =2 O - EMS 2 - way Standard UCMS sorting networks have been designed using automated network generation software for aa number of sorting networks. Examples of UCMS SV source code are provided. FIG . 38 shows top level SV code for the UCMS 4 - column example with 8 -bit unsigned values . This code effectively creates the “ UCMS Sorting Network Top Level ” block shown in FIG . 33. The SV module instantiates the 3 sequence modules, and passes signals from Sequence 0 to Sequence 1 , and from Sequence 1 to Sequence 2. In addition, in the generate block , the 1 - d input list is translated to the 2 - d array needed by Sequence 0 module , and the final 2 - d Sequence 2 output array is translated to the 1 - d sorted output list. FIG . 39 shows the simple Sequence 0 SV code for the UCMS 4 - column example. Inside the generate block , the 16 Sequence 0 2 - sorters are instantiated , one per column of the 8 values , before continuing on with a standard Ncols =2 50 2 -way merge algorithm . A single - stage hardware 8 - sorter could also be used to sort the first groups of 8 values . The hardware 8 - sorter is even faster than the non - standard Ncols =3 network , when using 6 -input LUT slice logic blocks , but uses a large number of 55 LUT resources to obtain this speed . Similar UCMS network modifications can presumably be made to improve performance in some way. If the modifi cations use principles discussed above , such as use of non - standard UCMS networks, or the use of single - stage 60 hardware sorters in place of portions of a sorting network , such modifications will be in keeping with the various embodiments that have been disclosed here . The information and equations that have been presented so far in this set of embodiments are enough to allow a 65 designer to implement any standard UCMS network that satisfies Equation ( 1 ) . Such a network takes an unsorted list of Nfinal values , and then produces a correctly sorted full list state -of - the - art O - EMS network , the O - EMS network uses 19 / 15 = 1.27 times the resources of the faster, non -standard Ncols =3 sorting network . 2 row by 16 column Sequence 0 2 - d data array . merge. An Ncols =2 sorting network , with Ninas > 8, could use the Ncols = 3 non -standard network to sort the first groups of 23 US 11,360,740 B1 24 of those same Nfinal values . It has also been shown how a Max -of - 4 . The final stage in Sequence 1 , the R / C 1/2 Stage different hardware sorters and filters . In Stage 0 , the column sort stage , single -stage hardware 3 - sorters are used . In Stage Because of this, only finding the max of an unsorted list will be discussed here . non - standard UCMS network is created easily from a stan- 3 , uses a single Median - of - 3 rank order filter. A full UCMS dard network . In the particular example that was discussed , sorting network for the 5x5 set of input values is not shown, the non - standard 3 -way network was shown to outperform a but it would require 2 more stages , a R / C 1/3 Stage 4 and a comparable state -of - the - art O - EMS network, for both speed 5 R / C 1/4 Stage 5. In addition to using fewer resources than and resource usage , when both are implemented using the a full sort of the 5x5 values , the UCMS median rank order sorting network for the 5x5 values uses 2 fewer stages . 6 -input LUTs commonly founda in modern FPGAs. As previously discussed , in a rank order filter, only certain Although the examples discussed above target a 3x3 or output locations are produced from an unsorted list of input 5x5 square of values , the median stage reduction is a more values . Often , the rank order filter only produces one value , 10 general phenomenon. When Ncols is odd and Nrowsfinal is e.g. , the max , min , or median of the unsorted input list . odd, determining the median of Nfinal values will require However, UCMS sorting networks are used effectively to fewer stages than a full sort of those Nfinal values . When produce several types of rank order filters. Ncols =3 , the median stage reduction is 1 , the reduction is 2 One prior art use of multiway sorting networks to produce when Ncols = 5 , it is 3 when Ncols =7 , and so on . rank order filters were the efforts by several researchers to 15 As discussed above , using a prior art O - EMS methodol extract the median of 3x3 images . diagram showing the ogy , the max or min (or both ) of an unsorted list of 2 values UCMS 3x3 median filter is shown in FIG . 45. The algorithm is determined in p 2 - sorter stages. Using the methodologies used for the UCMS 3x3 median filter is essentially the same of multiway sorting networks, this relationship can be gen prior art algorithm used by these researchers. However, in eralized . With Ncols 3 , the max of an unsorted list of Ncolsp order to implement a 3 - sorter or filter operation with 3 20 values is determined in p stages , each of which contains inputs, those researchers either used a 3 - stage network of Ncols -max rank order filters. hardware 2 - sorters or incompletely defined hardware 3 - sortThe methodology for finding the min of an unsorted list ers to implement their sorting network . UCMS uses the uses the same number of resources, and has the same single - stage hardware 3 - sorters and filters discussed in ear- propagation delay, as finding the list max . If both the min lier embodiments. 25 and max are produced, the number of required resources Note that the sorting network in FIG . 45 uses several increases, but the propagation delay does not change. 1 of Sequence 1 , 3 different single - stage hardware rank order filters are used . A 3 -min hardware filter is used in Row 2 , the row of max column values . A 3 - median hardware filter is used in Row 1 , the row of median column values , and a 3 -max hardware filter is used in Row 0 , the row of min column values . In the final stage , Stage 2 of Sequence 1 , one single - stage 3 -median filter is used . The unsorted input list for the 3x3 median example is shown at the top of FIG . 45. This is the same input list for the full 3x3 sorting network example shown in FIG . 30 . Note that the full Nfinal = 9 sort requires 4 stages , but finding the median of those the 9 unsorted values only requires 3 stages . Since these 3 stages only use 2 - sorters and 3 -sorters, each of the 3 stages only uses the minimum stage time , a stage with only 2 - sorters, when implemented using a 4 - LUT logic block . Although this 3 - stage sorting network determines the median of the 3x3 values quickly, an even quicker solution is available , a 9 -median single - stage hardware filter. When using a slice logic block with 4 6 -input LUTs, the input signals for this filter propagate through 4 logic blocks , which is the equivalent of 2 2 - sorter stages in series . A full hardware 9 - sorter uses a large number of resources . However, a hardware 9 -median filter eliminates all logic and output muxes , except those required for the median value . The reduced hardware usage of the 9 -median hardware filter, In the same amount of time used by prior art max 30 networks using 2 -max filters, UCMS max filter networks, using single - stage hardware N - max filters with N23 , are able to find the max of much larger lists . For example, as shown in FIG . 47 , the max of 25 values is determined in 2 stages , using 5 -max hardware filters. In effectively the same amount 35 of time , prior art O - EMS methodology using 2 -max filters will only determine the max of 4 unsorted input values . Furthermore, it will take a prior art O - EMS sorting network using 2 -max filters 3 stages to find the max of 8 values . Using UCMS 5 -max filters, the max of 125 values is 40 determined in 3 stages , and these 3 stages take approxi mately the same amount of time as the 3 prior art O - EMS stages . The UCMS sorting network system is a unified and methodical system , utilizing single - stage hardware N -sort 45 ers instantiated in multiway merge sort networks. The UCMS sorting network system satisfies Equation ( 1 ) above and can be modified for improved performance. In a rank order filter, only certain output locations are produced from an unsorted list of input values . Often, the rank order filter 50 only produces one value , e.g. , the max , min , or median of the unsorted input list . However, UCMS sorting networks are used effectively to produce several types of rank order filters . Further modifications and alternative embodiments of along with its reduced propagation delay, may make it the 55 various aspects of the invention will be apparent to those skilled in the art in view of this description. Accordingly, this best choice for calculating a 3x3 median value . a FIG . 46 shows a UCMS sorting network median filter for description is to be construed as illustrative only and is for a 5x5 set of unsorted input values . The algorithm shown in the purpose of teaching those skilled in the art the general this figure , which uses single - stage hardware sorters and manner of carrying out the invention. It is to be understood filters, has not been shown in prior art. Sequence 0 in the 60 that the forms of the invention shown and described herein FIG . 46 example uses 5 5 -sorters. The row sort stage , Stage are to be taken as examples of embodiments. Elements and 1 in Sequence 1 , uses 5 different rank order filters, each of materials may be substituted for those illustrated and which outputs at least 2 values in its sorted list . From the top described herein , parts and processes may be reversed , and row down to the bottom row , the rank order filters used are certain features of the invention may be utilized indepen Min - 2 -of - 5 , Min - 3 -of- 5 , Mid - 3 -of- 5 , Max - 3 -of - 5 , and Max- 65 dently, all as would be apparent to one skilled in the art after 2 -of- 5 . Stage 2 in Sequence 1 , the R / C 1/1 stage , uses 3 having the benefit of this description of the invention . different rank order filters, Min -of - 4 , Median -of-5 , and Changes may be made in the elements described herein 25 US 11,360,740 B1 without departing from the spirit and scope of the invention 6. The method according to claim 1 further comprising a as described in the following claims . The invention claimed is : 1. A method for designing a single - stage hardware N - sorter, the method comprising steps of: applying to input ports an input list of N unsorted data input values , where N23 , and each N - sorter internal input data value is supplied by an input port; using a comparison operator to generate, in parallel, all N * ( N - 1 ) / 2 possible 2 - value comparison result signals for the input list ; enforcing an order for identical input values , in which an input value located higher in the input list is judged to be greater than an identical input value located lower in the input list ; 26 5 10 15 step of building the multiplexer select line signals, wherein the building step further comprises steps of: creating for each of the N data inputs all 2N - 1 possible product terms, with each product term containing all of the N - 1 comparison signals for this input, and with each comparison signal specified in its inverted or non -inverted state; for each comparison signal state in a product term , assigning a “ win ” if the data input signal is on the left side of the comparison operator, and the comparison signal state is non - inverted , or assigning a “ win ” if the data input signal is on the right side of the operator, and the comparison signal state is inverted ; summing the “ wins” for each product term ; and adding each product term to the input's particular SOP equation in which each product term in the SOP equation has that same number of “ wins ” , where the number of “ wins ” indicates which output port the input value is assigned to . providing a set of output multiplexers, each multiplexer having N data input signals and N - 1 multiplexer select 20 7. The method of claim 1 further comprising a step of line signals; in the output multiplexers, assigning, in parallel, each of modifying the method for a particular hardware type. 8. The method of claim wherein the particular hardware the Nd data input values to an output port, using both is one or more selected from the group: a logic block the N data input signals and the multiplexer select line type with one or more Look Up Tables (LUT ) , and associated signals ; and outputting to output ports an output list of sorted values, 25 2- 9.to-multiplexers 1 method of. claim 8 , wherein the LUT is a 6 -input The wherein an order of duplicate values in the output list LUT. matches the order of those values in the input list . 10. The method of claim 7 , wherein the particular hard 2. The method according to claim 1 , wherein the com ware a FieldofProgrammable Array (FPGA parison operator is ‘ greater than or equal' ( 2 ) operator, and 11. type The ismethod claim 1 furtherGate comprising a step).of the input value located higher in the input list is on the left 30 using a Hardware Description Language (HDL ) . side of the > operator, and the input value located lower in 12. The method of claim 11 , wherein the HDL is System Verilog ( SV) . the input list is on the right side of the z operator. 3. The method according to claim 1 , wherein the assigning 13. The method of claim 1 further comprising a step of step further comprises a of using ternary syntax or condi- modifying the single stage hardware N - sorter to create a 35 single stage N - to - M hardware filter, wherein M < N . tional syntax. 14. The method of claim 1 further comprising a step of 4. The method according to claim 1 , wherein the multi plexer select line signals propagate through an amount of modifying the single stage hardware N - sorter to create a series logic used to produce the multiplexer select line N -max hardware filter. 15. The method of claim 1 further comprising a step of signals . 9 5. The method according to claim 1 , wherein each mul- 40 modifying the single stage hardware N -sorter to create a N -min hardware filter. tiplexer select line signal is defined by a Sum -Of- Products ( SOP ) equation.