## ЗАПРОС ДАННЫХ С ПЛАТЫ FPGA (ПЛИС) QUERYDATAWITH ANFPGA BOARD PhD Candidate: Haider I Mohsin, Prof. J.I.Batyrkanov Kyrgyz, state Technical University liveoffice12@gmail.com In thispaperacommercial FPGA coprocessorboard issued to accelerate the data of queries on a relational database that contains texts and images. FPGA designs for texts earching and image matching are described and their performances summarized. Apotential design for a database JOIN operatoristhen studied. A query optimization preprocessor is then proposed. Keywords:Field ProgrammableGateArray (FPGA), ConfigurableComputing,Textand ImageDatabase, OueryOptimization #### 1. Introduction Computerswith **FPGAcoprocessorboards** havebeen usedto acceleratemanydifferent applicationsin thepast, especially those that do not involvemassiveamount offloating point computations [1,2, 3, Query dataon textand imagedatabases providesampleopportunities foracceleration withsuchcoprocessorboards. thispaperacommercialF-In **PGAcoprocessorboard** isused toacceleratethe dataofqueries onarelationaldatabasethatcontainstextsandimages. In [5], ASICchipsweredevelopedforimage and stringmatchingdataand ahybridand reconfigurableboard architecturewas proposedthatcombinedthoseASICchipsand FPGAs/EPLDsformultimedia applications. Thedifficultyof that approachisthat the functionalityofthe ASICchipsneedsto beversatileenough to cover theapplicationdomain. In Section 2, FPGAdesignsfortext searchingand imagematchingaredescribed and theirperformancesusingaXilinx XC4085on acommercialF-PGAboard are summarized. Thedesignsareusedto acceleratequeriesonan Oracle8idatabase. Apotentialdesign foradatabaseJOIN operatoris then studied. In Section 3,a queryoptimization preprocessoris proposed. Section 4concludesthepaper. # 2. TextSearching andImage Matching In order to evaluate the proposed approach, a test platform was used which was a 600 MHzPentiumIIIpersonalcomputerwith an AnnapolisMicro Systems'sWildForceTM FPGAboard. Onthatcomputera small text/imagedatabasewassetup andaGUI interfacewasdevelopedin Microsoft Foundation Classto displayimage/textdata records. Queriestoaccessthedatabasewere implemented based on ODBC(Open Data BaseConnectivity) API functions. Three different operationswerestudied,text searching, imagematching, and JOIN operator. They are described in the following subsections. ## 2.1 Text Searching The databasecontains records that include some photosand resumes. A keyword string can be input through the GUI and used to match against each text file. All the records whose employee resume files include the keyword were displayed. An efficientkeyword-matching algorithm calledtheKMPalgorithm (inventedby Knuth,Morris,andPratt)wasimplemented onthePC.Thealgorithm doesnotusemore than M+Ncharactercomparisonstomatcha keywordofMcharactersagainstastringof Ncharacters. Two FPGAdesignsfor thecomparison ofa keyword against a text file were implemented, one a straightforward hardware implementationoftheKMP algorithmandtheotherabruteforceusage of parallel-comparators to implement the naïvetextsearchalgorithm (of O(M\*N) complexity sequentially). It-wasfoundthat whiletheKMPalgorithmismoreefficient on sequentialmachines, itenjoys no advantage in FPGAimplementation. In fact, itsuffersfromacouple-ofdrawbacks:(1)Its control circuit ismorecomplicatedand(2) it ismoredifficult fortheFPGAtograb externaldata every clock cyclebecause whetheritneedsnew dataornotdependson thecomputationon-thecurrentdata. As a result, without having a complicated data buffer, the KMPFPGA implementation can only grab one newdata once every three clockcycles. By contrast, the brute force parallel comparatorimplementationsimply grabsnewcharactereveryclockcycle. In addition because thememory portcan providefourconsecutivecharactersinone clock cycle, four copies of the same hardwarecanbe used totakeadvantageof that data parallelism with the brute force implementation. Becauseofthesereasons the parallel comparator implementation is considered superior totheKMP one. Inorderto test "caseinsensitive"keyword search, all thetexts in eachtextfilewere converted to lowercases rightbeforethe keyword comparison. It-wasfoundthaton the PC for aspecific casewith 22 K bytesof text, it took 1.5 seconds for "case conversion" and 0.01 seconds for "keyword matching." An FPG Adesignthat performs both the case conversion and the keyword matching was implemented. Thedesign, utilizing 1,035 configurable logicblocks (CLBs), runs at10MHz and processes four characters perclock cycle. Since the FPGA design spends less than 0.01 seconds for both computations, the speedup is more than 150. Note that themajority of the computation is on the case conversion instead of the keyword matching. ## 2.2ImageMatching With the GUI, a userspecifies a128x 128 areaoutofa320x 240imageandusesthat area as atemplate tocomparetoallthe images inthedatabase. The record-whose image *bestmatchs* with the template (within an "acceptable" degreeofmatching) is displayed. To simplifyandspeed up the matching process, thegray-scaledimages were preprocessed and their pixels converted into binary values. The binary images are stored in the database together with thegray scaled ones. The templateis also binarized using a threshold chosen basedonthehistogram ofthetemplate. The resulting system is as shown in the following figure. In termsof thematchingcomputation, itisa movingwindowbased distancealculation. When-thetemplateisapplied to apixel location ofanimage, the distance is computed as the number of mismatches between the template and the image window. Becausethe templateand the imagearebothbinarized, atemplatepixel and animagepixelareconsidered mismatchedifanXOR operation produces a resultofone. The computation therefore lendsitselftoavery efficient Pentium implementation because an integer XOR instruction handles 32 pixels in parallel. A 256-entrylookup tableisused onthe Pentiumcode tocount thenumberofones in a32-bit integerresultingfromtheXOR operation.With thisefficientPentiumcode, thesearch onthePCtookaround 8.48 seconds fornineimages. Thesame search usingasinglecomputationalunitona XC4085 FPGAchip(with 3,136 CLBs)with aclockrateof40 MHzon aWildForce boardtookaround 3.91 seconds. Therefore the speedup is around 2.17. The FPGA design is as shown in the following diagram. Byusing multiplecomputation units, the potentialimprovementsofafuturedesign on thecurrentWildForceBoardareestimated as follows. On asingleFPGAchip, around 4 computationalunitscanbebuiltand the speedup willbecloseto 5orabove. With fourFPGAchipsperboard,thespeedup is expected to bearound 10 orabove. Here the speedup isnotassumed to belinearbecause otheroverheadsmaydegrade the performancewhenhigherdegreeof parallelismisexplored. Anotherpoint worth mentioningis that Xilinx Virtex FPGAchipswillbemore suitableforthis design thanthe 4085 chipsbeingused because the template could then bestored more efficiently (both inspace and inspeed) in on-chip block RAMs. # 2.3JOIN Operator Typicaldatabaseoperationsinvolve"JOIN" operations. AJOINoperation ontwo tables producesatablewhosekey valuesarein bothtables. Theoperation is basically for each elementin asequence A, to find a corresponding matching element in another sequence B. In order to develop an efficient FPGA design for the JOIN operation, the following sequential algorithm was developed. Supposetherearetwo sequences of integers, A and B. ThereareNelements in A, and M elements in B. AssumeN <<M. Thealgorithm to join thetwo sequences includes two steps: (1)Sort thesequenceA: complexityO(N logN) and (2)ForeachBelement, find the matching element in A by using abinary search:complexityO(M log N). Thereforethetotal complexityis O(M logN)that comes from theM binary searches. Thealgorithm was implemented and tested on an Oracle database. WhileapureOracleJOIN operation includingdatafetchingtook27 secondsonaparticulartestcase, the proposed sequentialJOINoperation with datafetched through the sameOracle8i databasetook-only22 seconds.Thisshows</p> thatitispossibleto build ourownJOIN operation thatisasefficientasoreven better thanapureOracleJOINoperation (atleast forthelimited testsweran). An FPGAdesign toacceleratetheJOIN operation, ormore speciallytoacceleratethe binarysearch hasbeen proposed. The design uses a pipeline of comparators where each comparison result determines the operand to be sent to the next comparator stage. With a40 MHzdesign, itisestimatedthat roughly40 millionintegerscan beprocessed eachsecond (assumingeachintegeris compared to 1,024 sorted values). Since a programrunning on a 600 MHz Pentium III PCspendsaround 30secondson similar operations, the FPGA design can potentially leadto an orderofmagnitudeimprovement fortheJOINoperation. ## 3. Query OptimizingPredata Onceadatabaseiscreatedandinuse,the main performance concern is about the data,optimization, and execution of queries. A querycan typically be representedasa"querytree,"whereeach leafnoderepresentsarelation(i.e.,table) andeachinternalnoderepresents adatabase operation. Forexample,thequerytreeas shown inthefollowingfigurecorrespondsto an optimizedrepresentationofthequery **Q** (Find the lastnamesofemployeesbornafter 1970 who workon atasknamed "Champ"). With optimizationtheinitial query tree would use the Cartesian product of three tables, EMPLOYEE, WORKS\_ON, and TASK, which maylead toahuge table. Thereforeitisbetterto applya setof rules to transformthe initial query tree to an optimized tree that is more efficient to execute. The result is shown in the previous $\label{eq:figure} figure, \ where basic database operations \ such as \textbf{SELECT} and \textbf{PROJECT} are applied to$ individual tables so to reduce tables izes beforethe **JOIN** operation is applied to joining multipletables. The process of convertingan initialquerytreeto an optimized querytreeiscalled query optimization. Sinceeachbasicdatabase operation maybe implemented in many different ways, theoptimization also involveschoosingthemostefficient mentation foreach operation. For example, thereareat least six differentways to implementaSELECToperation, includingabruteforcelinearsearch, a binarysearch, usingaprimary keyorhash key, and so on, fourdifferentways implementaJOINoperato tion, and two different ways to implementa PROJECT operation. Queryoptimization ina commercialdatabaseisnormallydoneby thedatabaseengineusingbuiltin index structuresonthehostcomputer. InordertouseFPGAcomputingresources todrastically speed upqueryexecution, we propose todevelop a"query preprocessor" thatoptimizesqueriesby taking into account theavailability of FPGAresources. The inputtothequerypreprocessorisaquery application in embedded SQL," which is basically-aprogram in the Clanguage with SQL statements embedded init.(ThoseSQL statementsare enclosedbyprecompilation directives.) The "querypreprocessor" will separateembeddedSQLstatementsfrom restoftheapplicationandconvertthem into twoparts, one for the FPGAresourcesand theotherfor thebackenddatabaseengine. The rationale is, because of the finite amountofFPGA resources, itmay notbe possibleto squeezeallthequeryoperations into the FPGA In addition, in the case when coprocessor board. there is not enough FPGA resources, the query preprocessorwillneedtoevaluatewhether toreconfiguretheFPGAformoredatabase operations or simply allocate those operations tothehostmachine. This also implies that optimization opportunities available through index are not wasted. More specifically, the steps involved in using thequery preprocessorwillbeas follows. - 1. The preprocessor isolates SQL statementsfrom anembeddedSQL application. - ThepreprocessorconvertsaSQLquery toaquerytreeandtransformsittoan optimized tree. - Thepreprocessoridentifies"sub-trees" thatcan beacceleratedwithFPGA computing. Each sub-treewillbe replacedwithCodes thatcontrolsthe FPGA coprocessorboardwhilethe remaining partofthetreewillstillbe codedin SQL. - After being compiled, the final application codecan beexecuted. A commercial databasebackendwillbe invoked fortheSQLpart andanFPGA board forthesub-treeC code. There will beparalleloperationsforthesubtreesallocatedtotheFPGAboard. #### Tree Partitioning treeispartitionedintotwoparts, query onefortheFPGAand the other forabackenddatabaseengineon thehostmachine, as shown the following figure. Inthisexample, instead ofsendingthethree tables to theFPGAboarddirectly, wefirst apply SE-**LECT**and**PROJECT**operations onthose threetablesandthensendtothe FPGAboard the resulting tableswhichare supposed to be much smaller. Forexample, after the table TASKgoesthroughthe SELECT TNAME='Champ'and **PROJECT** TNUMBER operations, the resulting tablehasonly onesinglecolumn. i.e., TNUMBER.andcontainsonlythose rowswhose TNAMEis'Champ'. Asshown inthefiguretheFP-GAboardisresponsible for two **JOIN** operations and two **PROJECT**operations. ## **FPGA** Computing Dependingonthesizeofthethreetables sent to theFPGAboard.differentFPGA designscould beused. Computation of those JOIN and PRO-JECToperations canbe pipelined to increasespeedup. Forthe JOIN TNUMBER=TNO operation, suppose there are only ten TNUMBER rows. In thatcase,ten comparatorscanbe usedforeveryincoming TNO seeifthereisamatch. However, since the number of TNUMBER rows cannot be pre-determined, amore complicateddesignthat includes comparators, FIFOs, and acontrollerwillbe moreappro-Thespeedupwilldepend onhowmany comparators can be squeezed into the hardware. For the **PROJECT ESSN** operation, it is necessary to remove duplicateESSNrowsandkeeponlydistinct ones. This can be done either with brute forcecomparisonsorwithhashing. Ineither case, parallelcomparisoncanagainbe applied. (Computationofthehashfunction canbepipelinedwiththecomparison, in the case of using hashing.) The query preprocessor will need to make the decision and choose the most efficient way of implementation. Notethatthethreetables inthedata-basearelockedduringthewhole transactionprocesssoas toavoid any modification to thetablesduringthe transaction byotherapplications. Example 2 This example corresponds to another query **Q2**(i.e., Find the last names of employees wearing glasses who work on a task named "Champ" and whose resume contains the keyword "FPGA"). In this case text/image databases are involved. Asshownintheprevious figure, theimage data of thepicturesandthesearching of the text documents should be handled with the FPGA resources Asamatterof fact, if there are not enough FPGA resources, the query preprocessor may decidethat itisbettertoimplementONLY the text/imagedatapartontheFPGA since those the maximal amountofparallelism. Sucha partitionis shownin the followingfigure. Apparently therearemany designoptionsother thanthe two previousdesigns. It is the responsibility of thequery preprocessortorankthose different design options choose the "best" one for implementation. ## QueryPreprocessorInternal The figureas shown at the end of the paperschematically illustrates the two majortools that need to be developed: "QueryPreprocessor" and the "FPGA Design Tool". Given a C/C++ application program (i.e., C/C++ file) that contains ODBC statements as an input, ausermayidentify some parts that have nothing to do with database operations but can be accelerated with FPGA boards. The user can then use some domain-specific FPGA design tools, whether they are for image data, video data, or text string data, to produce some FPGA designs and insert the corresponding host code interface in the C/C++ file. In addition, the user may add some preprocessor directives to help specify the parts in the C/C++ file that can be put into a query tree. The resultingC/C++ file,FPGA design files, and FPGA libraryofbasicdatabase operators act as inputs to the "Query Preprocessor" and the "FPGA Design Tool". ## 4. Conclusions Multimediadatabases havemany applications that need to beaccelerated. In this papersomesuch potential applications areaccelerated byusinga commercialFPGAcoprocessorboard based on XC4085. FPGAdesigns fortext searchingand imagematchingaredescribed and theirperformancessummarized. A potentialdesign foradatabaseJOIN ## REFERENCES [1]J.Villasenor,B.Schoner, K.Chia,C. Zapata, H.Kim, C. Jones, S.Lansing, and B. Mangione-Smith, "Configurable Computing Solutions for Automatic Target Recognition," in IEEE Symposium on FPGA Custom Computing Machines, pp. 70–79, 1996. [2] J.S.N.Jean,X.Liang,B.Drozd,K. Tomko, and Y. Wang, "Automatic Target Recognition with Dynamic Reconfiguration," to appear in Journal of VLSISignal Data. [3]M. Alderight, E.L. Gummati, V. Piuri, and G.R. Sechi, "A FPGA-based Implementation of a Fault-Tolerant Neural Architecture for Photon Identification," in Proc. of ACM/SIGDA International Symposium on FPGAs, pp. 166-172, 1997. [4] M.ShandandL.Moll, "Hardware/Software Integration in Solar Polarime- try,"inIEEESymposiumonFPGA Custom Computing Machines, pp. 18-26, #### Известия КГТУ им. И.Раззакова 31/2014 #### 2011. [5]H. Blume, H.M. Bluthgen, C. Henning, and P. Osterloh, "IntegrationofHigh-PerformanceASICs into Reconfigurable SystemsProvidingAdditional Multi- media Functionality,"IEEEInternational Conferenceon Application-Specific Systems, Architectures, and Processors (ASSP),July2010.