Papers by Thomas Yau-tat Lee
HKU CS Technical Report, Mar 29, 2013
Many semantic datasets or RDF datasets are very large but have no pre-defined data structures. Tr... more Many semantic datasets or RDF datasets are very large but have no pre-defined data structures. Triple stores are commonly used as RDF databases yet they cannot achieve good query performance for large datasets owing to excessive self-joins. Recent research work proposed to store RDF data in column-based databases. Yet, some study has shown that such an approach is not scalable to the number of predicates. The third common approach is to organize an RDF data set in different tables in a relational database. Multiple “correlated” predicates are maintained in the same table called property table so that table-joins are not needed for queries that involve only the predicates within the table. The main challenge for the property table approach is that it is infeasible to manually design good schemas for the property tables of a very large RDF dataset. We propose a novel data-mining technique called Attribute Clustering by Table Load (ACTL) that clusters a given set of attributes into correlated groups, so as to automatically generate the property table schemas. While ACTL is an NP-complete problem, we propose an agglomerative clustering algorithm with several effective pruning techniques to approximate the optimal solution. Experiments show that our algorithm can efficiently mine huge datasets (e.g., Wikipedia Infobox data) to generate good property table schemas, with which queries generally run faster than with triple stores and column-based databases.
Bookmarks Related papers MentionsView impact
This paper outlines the cleansing and extraction process of
infobox data from Wikipedia data dum... more This paper outlines the cleansing and extraction process of
infobox data from Wikipedia data dump into Resource Description
Framework (RDF) triplets. The numbers of the extracted
triplets, resources, and predicates are substantially
large enough for many research purposes such as semantic
web search. Our software tool will be open-sourced for researchers
to produce up-to-date RDF datasets from routine
Wikipedia data dumps.
Bookmarks Related papers MentionsView impact
In this paper, we study the data interoperability problem of web services in terms of XML schema ... more In this paper, we study the data interoperability problem of web services in terms of XML schema compatibility. When Web Service A sends XML messages to Web Service B, A is interoperable with B if B can accept all messages from A. That is, the XML schema R for B to receive XML instances must be compatible with the XML schema S for A to send XML instances, i.e., A is a subschema of B. We propose a formal model called Schema Automaton (SA) to model W3C XML Schema (XSD) and develop several algorithms to perform different XML schema computations. The computations include schema minimization, schema equivalence testing, subschema testing, and subschema extraction. We have conducted experiments on an e-commerce standard XSD called xCBL to demonstrate the practicality of our algorithms. One experiment has refuted the claim that the xCBL 3.5 XSD is backward compatible with the xCBL 3.0 XSD. Another experiment has shown that the xCBL XSDs can be effectively trimmed into small subschemas for specific applications, which has significantly reduced the schema processing time.
Bookmarks Related papers MentionsView impact
In this paper, we propose new models and algorithms to perform practical computations on W3C XML ... more In this paper, we propose new models and algorithms to perform practical computations on W3C XML Schemas, which are schema minimization, schema equivalence testing, subschema testing and subschema extraction. We have conducted experiments on an e-commerce standard XSD called xCBL to demonstrate the e?ectiveness of our algorithms. One experiment has refuted the claim that the xCBL 3.5 XSD is compatible with the xCBL 3.0 XSD. Another experiment has shown that the xCBL XSDs can be effectively trimmed into small subschemas for specific applications, which has significantly reduced schema processing time.
Bookmarks Related papers MentionsView impact
PhD Thesis, The University of Hong Kong, Nov 2009
The nature of software applications is evolving very quickly in the past decade since the World W... more The nature of software applications is evolving very quickly in the past decade since the World Wide Web has been popularized. Some web applications are required to process large datasets which do not have well-defined structures. This has been challenging conventional data engineering methods. A conventional data engineering method typically requires that a system architect should have prior knowledge on what and how data are processed in an application so as to design a good database schema that optimizes data computations and storage. However, for a web application processing large-scale semi-structured and unstructured data, schema design tasks cannot always be handled totally by human, and need to be automated by software tools. In this thesis, I study the problems of schema computations for semi-structured XML data and unstructured RDF data. This thesis consists of two parts. In the first part, I investigate into the XML data interoperability problem of web services. To address this problem, I develop a formal model for XML schemas called Schema Automaton, and derive the computational techniques for schema compatibility testing and subschema extraction. In the second part, I investigate di?erent types of databases for RDF data. For one particular database type called property tables, I propose a new datamining technique namely Attribute Clustering by Table Load to automate the schema design for the database based on the underlying data patterns.
Bookmarks Related papers MentionsView impact
Web forms are commonly used to capture data on the web. With Asynchronous Javascript and XML (Aja... more Web forms are commonly used to capture data on the web. With Asynchronous Javascript and XML (Ajax) programming, interactive web forms can be created. However, Ajax programming is complex in a way that the model-view-controller (MVC) code is not clearly separated. This paper discusses about a MVC-oriented web form development called “Webformer” that we develop to simplify and streamline web form development with Ajax. We introduce a scripting language called Web Form Application Language (WebFAL) for modeling web forms. Webformer hides the programming complexity by generating Ajax code directly from the web form models.
Bookmarks Related papers MentionsView impact
We propose the novel concept of “descriptive schema” (DS).
Unlike ordinary database schemas, a DS... more We propose the novel concept of “descriptive schema” (DS).
Unlike ordinary database schemas, a DS does not restrict the structure of the underlying database. Rather, it is just a probabilistic description of the structure. When answering keyword queries, DS can be used to improve semantics-based query answering and result ranking.
Bookmarks Related papers MentionsView impact
Data processing and integration on heterogeneous data sources often require intensive human resou... more Data processing and integration on heterogeneous data sources often require intensive human resources on coding, but applications developed for this purpose are usually inflexible to fulfill dynamic business requirements. To-date, the Web Service paradigm standardizes the programmatic interfaces for application-to-application communication. It has gained a considerable momentum as a means of facilitate processing and integrating heterogeneous data sources. To take full advantage of Web Services, we proposed an approach in which all operations and data sources are exposed as Web Services. Based on that approach, a novel declarative scripting language called WSIPL (Web Services Integration and Processing Language) is designed to drive the data integration and processing tasks via Web Services in minimal programming efforts. The reference architecture of a WSIPL system and implementation issues are discussed. Industries deployed with this technology found that WSIPL can successfully enhance their data integration and processing systems with higher flexibility and efficiency.
Bookmarks Related papers MentionsView impact
XML has become a very important emerging standard for E-commerce because of its flexibility and u... more XML has become a very important emerging standard for E-commerce because of its flexibility and universality. Many software designers are actively developing new systems to handle information in XML formats. We propose a generic architecture for processing XML. We have designed an XML processing system using the latest technologies such as XML, XSLT, HTTP and Java Servlets. Our design is very generic, flexible, scalable, extensible and suitable for distributed network environments. A main application of the architecture and the system is to support data exchange in electronic commerce systems.
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
Since its first release in 1998, the Digital 21 Strategy has been serving the blueprint for Hong ... more Since its first release in 1998, the Digital 21 Strategy has been serving the blueprint for Hong Kong to develop the information and communications technology infrastructure. With government leadership as an important component, the Strategy has put forward the development of an e-government that can realize one-stop delivery of public electronic services. In 2003, Hong Kong Government has launched the Interoperability Framework (IF) as an e-government initiative to facilitate implementation of cross-department joined-up government services. IF comprises two parts: (1) definition of a set of recommended technical specifications as a single point of reference for departments and contractors to implement joined-up projects, (2) a framework for formulating and managing XML message standards for G2G and G2B data exchange. This paper discusses about the second part of the IF, mainly on the XML Schema and Design Guide, and its real-life applications
Bookmarks Related papers MentionsView impact
One-stop public services and single window systems are primary goals of many e-government initiat... more One-stop public services and single window systems are primary goals of many e-government initiatives. How to facilitate the technical and data interoperability among the systems in different government agencies is a key of meeting these goals. While many software standards, such as Web Services and ebXML, have been formulated to address the interoperability between different technical platforms, the data interoperability problem remains to be a big challenge. The data interoperability concerns how different parties agree on what information to exchange, and the definition and representation of such information. To address this problem, the Hong Kong government has released the XML Schema Design and Management Guide as well as the Registry of Data Standards under its e-Government Interoperability Framework initiative. This paper introduces how the data modelling methodology provided by the Guide can be used to develop data interfaces and standards for e-government systems. We also discuss how the Macao government has formulated their data interoperability policy and has applied the Guide in their situation.
Bookmarks Related papers MentionsView impact
One-stop public services and single window systems are primary goals of many e-government initiat... more One-stop public services and single window systems are primary goals of many e-government initiatives. How to facilitate the technical and data interoperability among the systems in different government agencies is a key of meeting these goals. While many software standards, such as Web Services and ebXML, have been formulated to address the interoperability between different technical platforms, the data interoperability problem remains to be a big challenge. The data interoperability concerns how different parties agree on what information to exchange, and the definition and representation of such information. To address this problem, the Hong Kong government has released the XML Schema Design and Management Guide as well as the Registry of Data Standards under its e-Government Interoperability Framework initiative. This paper introduces how the data modelling methodology provided by the Guide can be used to develop data interfaces and standards for e-government systems. We also discuss how the Macao government has formulated their data interoperability policy and has applied the Guide in their situation.
Bookmarks Related papers MentionsView impact
In this paper, we discuss the various issues in designing intelligent software systems to assist ... more In this paper, we discuss the various issues in designing intelligent software systems to assist world- wide-web users in locating relevant information. We identify a number of key components in such intelligent systems. These include a web document database management system, a client-based goal-directed search engine, an intelligent learning agent which discovers users’ topics of interest by studying their browsing behavior, and an intelligent agent which monitors “hot” web sites. We give examples and suggestions on how these components are designed and implemented. We also describe the architecture of a prototype system that integrates the various components.
Bookmarks Related papers MentionsView impact
Proceedings of WWW, May 2003
Bookmarks Related papers MentionsView impact
OASIS UBL Hangzhou Plenary Meeting 2005, May 2005
This paper describes the architecture design of the UBL v2 data model that was discussed and prop... more This paper describes the architecture design of the UBL v2 data model that was discussed and proposed in the Hangzhou plenary meeting in May 2005.
Since UBL v1 only provides documents in the procurement context, the v1 model architecture does not have the concept of multiple contexts and all reusable Business Information Entities (BIEs) are placed in one model spreadsheet. It would be difficult to maintain a large number of reusable BIEs in the single spreadsheet when the library scales to cover more documents.
Therefore, a three-layer UBL v2 model architecture is proposed to enhance the maintainability of the library content when UBL v2 is planned to develop more documents in multiple business contexts, e.g. procurement, transportation, etc. In the new architecture, BIEs are organized into three layers or classes of model spreadsheets. This way, it is only necessary to use relevant modules of reusable BIEs to model a document for ease of maintenance. Similarly, it is possible to strip down document schemas by importing only relevant schema modules to improve schema process performance.
The proposed model architecture serves as a requirement for consideration by the Naming and Design Rules (NDR) Team since the Team may need to review and revise the current NDR and schema architecture to deal with the model architecture change. To be specific, a new mechanism may be needed to translate the model spreadsheets in the proposed model architecture into the schema files.
In this paper, the following topics are discussed:
1. UBL v1 data model architecture
2. Proposed UBL v2 data model architecture
3. Other considerations
Bookmarks Related papers MentionsView impact
Talks by Thomas Yau-tat Lee
A presentation deck for a talk introducing machine learning and data science to an audience of bu... more A presentation deck for a talk introducing machine learning and data science to an audience of business executives.
Bookmarks Related papers MentionsView impact
Cloud computing is believed to be another big wave of Internet technology after World Wide Web an... more Cloud computing is believed to be another big wave of Internet technology after World Wide Web and mobile computing. The Open Group has identified cloud computing as a major driver to develop global GDP. In Hong Kong, the Office of Government CIO (OGCIO) has established the Expert Group on Cloud Computing Services and Standards (EGCCSS) to drive cloud computing adoption and deployment. Various cloud technical committees, including the two groups mentioned above, have identified the interoperability and portability of cloud services as a key principle for stimulating and driving economic benefits. EGCCSS has formed a Working Group Cloud Computing Interoperability Standards (WGCCIS) specifically to address this challenge.
In this talk, Dr Thomas Lee shares his experience in working in WGCCIS as a co-opt member and introduces the Open Group Guide on Cloud Computing Portability and Interoperability. He explains the fundamental concepts of cloud interoperability and portability and the reference architecture to design interoperable interfaces between on-premise and cloud application components. He also discusses the architectural principles for supporting cloud service providers to develop interoperable cloud services. From the enterprise user perspective, he also summarizes some good practices from the Open Group Guide that help cloud consumers to formulate their cloud strategy to manage vendor lock-in when selecting cloud services.
Bookmarks Related papers MentionsView impact
Uploads
Papers by Thomas Yau-tat Lee
infobox data from Wikipedia data dump into Resource Description
Framework (RDF) triplets. The numbers of the extracted
triplets, resources, and predicates are substantially
large enough for many research purposes such as semantic
web search. Our software tool will be open-sourced for researchers
to produce up-to-date RDF datasets from routine
Wikipedia data dumps.
Unlike ordinary database schemas, a DS does not restrict the structure of the underlying database. Rather, it is just a probabilistic description of the structure. When answering keyword queries, DS can be used to improve semantics-based query answering and result ranking.
Since UBL v1 only provides documents in the procurement context, the v1 model architecture does not have the concept of multiple contexts and all reusable Business Information Entities (BIEs) are placed in one model spreadsheet. It would be difficult to maintain a large number of reusable BIEs in the single spreadsheet when the library scales to cover more documents.
Therefore, a three-layer UBL v2 model architecture is proposed to enhance the maintainability of the library content when UBL v2 is planned to develop more documents in multiple business contexts, e.g. procurement, transportation, etc. In the new architecture, BIEs are organized into three layers or classes of model spreadsheets. This way, it is only necessary to use relevant modules of reusable BIEs to model a document for ease of maintenance. Similarly, it is possible to strip down document schemas by importing only relevant schema modules to improve schema process performance.
The proposed model architecture serves as a requirement for consideration by the Naming and Design Rules (NDR) Team since the Team may need to review and revise the current NDR and schema architecture to deal with the model architecture change. To be specific, a new mechanism may be needed to translate the model spreadsheets in the proposed model architecture into the schema files.
In this paper, the following topics are discussed:
1. UBL v1 data model architecture
2. Proposed UBL v2 data model architecture
3. Other considerations
Talks by Thomas Yau-tat Lee
In this talk, Dr Thomas Lee shares his experience in working in WGCCIS as a co-opt member and introduces the Open Group Guide on Cloud Computing Portability and Interoperability. He explains the fundamental concepts of cloud interoperability and portability and the reference architecture to design interoperable interfaces between on-premise and cloud application components. He also discusses the architectural principles for supporting cloud service providers to develop interoperable cloud services. From the enterprise user perspective, he also summarizes some good practices from the Open Group Guide that help cloud consumers to formulate their cloud strategy to manage vendor lock-in when selecting cloud services.
infobox data from Wikipedia data dump into Resource Description
Framework (RDF) triplets. The numbers of the extracted
triplets, resources, and predicates are substantially
large enough for many research purposes such as semantic
web search. Our software tool will be open-sourced for researchers
to produce up-to-date RDF datasets from routine
Wikipedia data dumps.
Unlike ordinary database schemas, a DS does not restrict the structure of the underlying database. Rather, it is just a probabilistic description of the structure. When answering keyword queries, DS can be used to improve semantics-based query answering and result ranking.
Since UBL v1 only provides documents in the procurement context, the v1 model architecture does not have the concept of multiple contexts and all reusable Business Information Entities (BIEs) are placed in one model spreadsheet. It would be difficult to maintain a large number of reusable BIEs in the single spreadsheet when the library scales to cover more documents.
Therefore, a three-layer UBL v2 model architecture is proposed to enhance the maintainability of the library content when UBL v2 is planned to develop more documents in multiple business contexts, e.g. procurement, transportation, etc. In the new architecture, BIEs are organized into three layers or classes of model spreadsheets. This way, it is only necessary to use relevant modules of reusable BIEs to model a document for ease of maintenance. Similarly, it is possible to strip down document schemas by importing only relevant schema modules to improve schema process performance.
The proposed model architecture serves as a requirement for consideration by the Naming and Design Rules (NDR) Team since the Team may need to review and revise the current NDR and schema architecture to deal with the model architecture change. To be specific, a new mechanism may be needed to translate the model spreadsheets in the proposed model architecture into the schema files.
In this paper, the following topics are discussed:
1. UBL v1 data model architecture
2. Proposed UBL v2 data model architecture
3. Other considerations
In this talk, Dr Thomas Lee shares his experience in working in WGCCIS as a co-opt member and introduces the Open Group Guide on Cloud Computing Portability and Interoperability. He explains the fundamental concepts of cloud interoperability and portability and the reference architecture to design interoperable interfaces between on-premise and cloud application components. He also discusses the architectural principles for supporting cloud service providers to develop interoperable cloud services. From the enterprise user perspective, he also summarizes some good practices from the Open Group Guide that help cloud consumers to formulate their cloud strategy to manage vendor lock-in when selecting cloud services.