Lecture 5 - Semi-Structured Data

CS 795/895 introduces semi-structured data formats XML and JSON. XML uses tags to annotate sections of a document and allow for nested, self-describing data exchange. JSON is a lightweight alternative to XML that represents data as name-value pairs within JavaScript-style objects. The lecture discusses XML syntax including elements, attributes, and schemas as well as comparing XML to structured data and introducing JSON.

Uploaded by

Deepak Balusa

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views

Lecture 5 - Semi-Structured Data

Uploaded by

Deepak Balusa

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

CS 795/895

Introduction to Data Science

Lecture 5- Semi-Structured Data

(XML, JSON)
Dr. Sampath Jayarathna
Old Dominion University
1
Introduction
• XML: Extensible Markup Language
• Defined by the WWW Consortium (W3C)
• Documents have tags giving extra information about sections of the
document
• E.g. <title> XML </title> <slide> Introduction …</slide>
• Extensible, unlike HTML
• Users can add new tags, and separately specify how the tag should be handled
for display
XML Introduction (Cont.)
• The ability to specify new tags, and to create nested tag structures make
XML a great way to exchange data, not just documents.
• Much of the use of XML has been in data exchange applications, not as a replacement for
HTML
• Tags make data (relatively) self-documenting
• E.g.
<?xml version = "1.0"?>
<bank>
<account>
<account_number> A-101 </account_number>
<branch_name> Downtown </branch_name>
<balance> 500 </balance>
</account>
<depositor>
<account_number> A-101 </account_number>
<customer_name> Johnson </customer_name>
</depositor>
</bank>
XML: Motivation
• Data interchange is critical in today’s networked world
• Examples:
• Banking: funds transfer
• Order processing (especially inter-company orders)
• Scientific data
• Chemistry, Genetics
• Paper flow of information between organizations is being replaced by
electronic flow of information
• Each application area has its own set of standards for representing
information
• XML has become the basis for all new generation data interchange
formats
• For awhile, XML (extensible markup language) was the only choice for open
data interchange. But over the years there has been a lot of transformation in
the world of open data sharing. The more lightweight JSON (Javascript object
notation) has become a popular alternative to XML for various reasons.
XML Motivation (Cont.)
• Earlier generation formats were based on plain text with line headers
indicating the meaning of fields
• Similar in concept to email headers
• Does not allow for nested structures, no standard “type” language
• Tied too closely to low level document structure (lines, spaces, etc)
• Each XML based standard defines what are valid elements, using
• XML type specification languages to specify the syntax
• DTD (Document Type Descriptors)
• XML Schema
• Plus textual descriptions of the semantics
• XML allows new tags to be defined as required
• However, this may be constrained by DTDs
• A wide variety of tools is available for parsing, browsing and querying
XML documents/data
Comparison with Structured (Relational) Data

• Inefficient: tags, which in effect represent schema information, are

repeated
• Better than relational tuples as a data-exchange format
• Unlike relational tuples, XML data is self-documenting due to presence of
tags
• Non-rigid format: tags can be added
• Allows nested structures
• Wide acceptance, not only in database systems, but also in browsers, tools,
and applications
Structure of XML Data

• Tag: label for a section of data

• Element: section of data beginning with <tagname> and ending with
matching </tagname>
• Elements must be properly nested
• Proper nesting
• <account> … <balance> …. </balance> </account>
• Improper nesting
• <account> … <balance> …. </account> </balance>
• Formally: every start tag must have a unique matching end tag, that is in the
context of the same parent element.
• Every document must have a single top-level element
Example of Nested Elements
<?xml version = "1.0"?>
<bank-1>
<customer>
<customer_name> Hayes </customer_name>
<customer_street> Main </customer_street>
<customer_city> Harrison </customer_city>
<account>
<account_number> A-102 </account_number>
<branch_name> Perryridge </branch_name>
<balance> 400 </balance>
</account>
<account>
…
</account>
</customer>
.
.
</bank-1>
Structure of XML Data (Cont.)
• Mixture of text with sub-elements is legal in XML.
• Example:
<account>
This account is seldom used any more.
<account_number> A-102</account_number>
<branch_name> Perryridge</branch_name>
<balance>400 </balance>
</account>
• Useful for document markup, but discouraged for data representation
Attributes
• Elements can have attributes
<account acct-type = “checking” >
<account_number> A-102 </account_number>
<branch_name> Perryridge </branch_name>
<balance> 400 </balance>
</account>
• Attributes are specified by name=value pairs inside the starting tag of
an element
• An element may have several attributes, but each attribute name can
only occur once
<account acct-type = “checking” monthly-fee=“5”>
Class Activity 9

• Convert the following Tree structure to bookstore.xml

11
Attributes vs. Subelements
• Distinction between subelement and attribute
• In the context of documents, attributes are part of markup, while
subelement contents are part of the basic document contents
• In the context of data representation, the difference is unclear and
may be confusing
• Same information can be represented in two ways
<account account_number = “A-101”> …. </account>

<account>
<account_number>A-101</account_number> …
</account>
• Suggestion: use attributes for identifiers of elements, and use
subelements for contents
More on XML Syntax
• Elements without subelements or text content can be abbreviated by
ending the start tag with a /> and deleting the end tag
• <account number=“A-101” branch=“Perryridge” balance=“200 />
• To store string data that may contain tags, without the tags being
interpreted as subelements, use CDATA as below
• <![CDATA[<account> … </account>]]>
Here, <account> and </account> are treated as just strings
CDATA stands for “character data”, text that will NOT be parsed by a parser
XML Document Schema

• Database schemas constrain what information can be stored, and the

data types of stored values
• XML documents are not required to have an associated schema
• However, schemas are very important for XML data exchange
• Otherwise, a site cannot automatically interpret data received from another
site
• Two mechanisms for specifying XML schema
• Document Type Definition (DTD)
• Widely used
• XML Schema
• Newer, increasing use
Why DTDs?

• XML documents are designed to be processed by computer programs

• If you can put just any tags in an XML document, it’s very hard to write a
program that knows how to process the tags
• A DTD specifies what tags may occur, when they may occur, and what
attributes they may (or must) have
• A DTD allows the XML document to be verified (shown to be legal)
• A DTD that is shared across groups allows the groups to produce
consistent XML documents
DTD example: XML
<?xml version="1.0"?> mydoc.dtd
<!DOCTYPE weatherReport SYSTEM <!ELEMENT weatherReport (date, location,
"http://www.mysite.com/mydoc.dtd"> temperature-range)>
<weatherReport> <!ELEMENT date (#PCDATA)>
<!ELEMENT location (city, state, country)>
<date>05/29/2002</date>
<!ELEMENT city (#PCDATA)>
<location> <!ELEMENT state (#PCDATA)>
<city>Philadelphia</city>, <!ELEMENT country (#PCDATA)>
<state>PA</state> <!ELEMENT temperature-range
<country>USA</country> ((low, high)|(high, low))>
</location> <!ELEMENT low (#PCDATA)>
<temperature-range> <!ELEMENT high (#PCDATA)>
<high scale="F">84</high> <!ATTLIST low scale (C|F) #REQUIRED>
<low scale="F">51</low> <!ATTLIST high scale (C|F) #REQUIRED>
</temperature-range>
</weatherReport>
XML Parsing
https://www.cs.odu.edu/~sampath/courses/f18/cs795/files/data/country_data.xml
import xml.etree.ElementTree as et
tree = et.parse('country_data.xml')
root = tree.getroot()
#root has a tag and a dictionary of attributes:
print(root.tag)
#print(root.attrib)
#Children are nested, and we can access specific child nodes by index:
print(root[0][1].text)
#It also has children nodes over which we can iterate:
for child in root:
print(child.tag, child.attrib)
# For more information: https://docs.python.org/2/library/xml.etree.elementtree.html
JSON as an XML Alternative
• JSON = JavaScript Object Notation
• It’s really language independent
• most programming languages can easily read it and instantiate objects or
some other data structure
• JSON is a light-weight alternative to XML for data-interchange
• Started gaining tracking ~2006 and now widely used
• http://json.org/ has more information
JSON Data – A name and a value
• A name/value pair consists of a field name (in double quotes), followed by
a colon, followed by a value
• Unordered sets of name/value pairs
• Begins with { (left brace)
• Ends with } (right brace)
• Each name is followed by : (colon)
• Name/value pairs are separated by , (comma)

{
"employee_id": 1234567,
"name": "Jeff Fox",
"hire_date": "1/1/2013",
"location": "Norwalk, CT",
"consultant": false
}
JSON Data – A name and a value
• In JSON, values must be one of the following data types:
• a string
• a number
• an object (JSON object)
• an array
• a boolean
• null

{
"employee_id": 1234567,
"name": "Jeff Fox",
"hire_date": "1/1/2013",
"location": "Norwalk, CT",
"consultant": false
}
JSON Data – A name and a value
• Strings in JSON must be written in double quotes.
{ "name":"John" }

• Numbers in JSON must be an integer or a floating point.

{ "age":30 }

• Values in JSON can be objects.

{
"employee":{ "name":"John", "age":30, "city":"New York" }
}

• Values in JSON can be arrays.

{
"employees":[ "John", "Anna", "Peter" ]
}
Another example: XML vs JSON
<?xml version="1.0"?>
<employees>
    <employee>
        <firstName>John</firstName> <lastName>Doe</lastName>
    </employee>
    <employee>
        <firstName>Anna</firstName> <lastName>Smith</lastName>
    </employee>
    <employee>
        <firstName>Peter</firstName> <lastName>Jones</lastName>
    </employee>
</employees>

{"employees":[
    { "firstName":"John", "lastName":"Doe" },
    { "firstName":"Anna", "lastName":"Smith" },
    { "firstName":"Peter", "lastName":"Jones" }
]}
JSON Parsing
import json
json_string = '{"first_name": "Guido", "last_name":"Rossum",
"phone":[9098693256, 9097846521]}'
parsed_json = json.loads(json_string)
data = DataFrame(parsed_json)
print(parsed_json['first_name'])
phone = list(parsed_json['phone'])
Note: For external file read, use
print(phone) json.load and data pretty print to
display the content of json file.
print(data)
from pprint import pprint
data = json.load(open('data.json'))
pprint(data)
Class Activity 10
• Convert the following bookstore.xml to bookstore.json
<?xml version="1.0"?>
<bookstore>
<book category="sci-fi">
<title lang="en"> 2001</title>
<author>Arthur C. Clarke</author>
<price>$30.0</price>
<year>1968</year>
</book>
<book>
<title lang="rs">Story about a True Man</title>
<author>Boris Polevoy</author>
<price>$20.00</price>
<year>1952</year>
</book>
</bookstore> 24
XML vs JSON
• JSON is Like XML Because
• Both JSON and XML are "self describing" (human readable)
• Both JSON and XML are hierarchical (values within values)
• Both JSON and XML can be parsed and used by lots of programming languages

• JSON is Unlike XML Because

• JSON doesn't use end tag
• JSON is shorter
• JSON is quicker to read and write
• JSON can use arrays
• JSON has a better fit for OO systems than XML

• The biggest difference is:

• XML has to be parsed with an XML parser. JSON can be parsed by a standard
JavaScript function.
Why JSON?

Steps involved in exchanging data from web server to

browser involves:
Using XML
1. Fetch an XML document from web server.
2. Use the XML DOM to loop through the document.
3. Extract values and store in variables.
4. It also involves type conversions.

Using JSON
5. Fetch a JSON string.
6. Parse the JSON using JavaScript functions.

How To Add A ZTE ONT On Huawei OLT
No ratings yet
How To Add A ZTE ONT On Huawei OLT
10 pages
Chapter 10: XML
No ratings yet
Chapter 10: XML
51 pages
XML and Web Databases: Dr. M. Brindha Assistant Professor Department of CSE NIT, Trichy-15
No ratings yet
XML and Web Databases: Dr. M. Brindha Assistant Professor Department of CSE NIT, Trichy-15
58 pages
Siam6 PDF
No ratings yet
Siam6 PDF
47 pages
Structure of XML Data XML Document Schema Xpath
No ratings yet
Structure of XML Data XML Document Schema Xpath
29 pages
Lic. Hairol Romero Sandí: Lección # 10
No ratings yet
Lic. Hairol Romero Sandí: Lección # 10
57 pages
Adbms Unit1
No ratings yet
Adbms Unit1
19 pages
Unit 3
No ratings yet
Unit 3
80 pages
Unit-1 XML To RWD
No ratings yet
Unit-1 XML To RWD
103 pages
Unit 5 XML
No ratings yet
Unit 5 XML
73 pages
Unit 9 Java and XML
No ratings yet
Unit 9 Java and XML
29 pages
Chapter 11
No ratings yet
Chapter 11
73 pages
Monday, January 30, 2006
No ratings yet
Monday, January 30, 2006
34 pages
XML - DTD & Schema
No ratings yet
XML - DTD & Schema
200 pages
Introduction To XML
No ratings yet
Introduction To XML
49 pages
DB Unit-3
No ratings yet
DB Unit-3
18 pages
Lecture 09
No ratings yet
Lecture 09
110 pages
Chapter 4 XML
No ratings yet
Chapter 4 XML
52 pages
Chapter 11: XML: Data Integration
No ratings yet
Chapter 11: XML: Data Integration
73 pages
Proejct Part C Homework 3: About
No ratings yet
Proejct Part C Homework 3: About
60 pages
Introduction To XML Extensible Markup Language: Prof.N.Nalini AP (SR) VIT
No ratings yet
Introduction To XML Extensible Markup Language: Prof.N.Nalini AP (SR) VIT
35 pages
Module 5
No ratings yet
Module 5
29 pages
Unit Iv
No ratings yet
Unit Iv
17 pages
CH4 WEB Lecture
No ratings yet
CH4 WEB Lecture
24 pages
Unit-III Introduction To XML
No ratings yet
Unit-III Introduction To XML
25 pages
0432 XML DTD and XML Schema
No ratings yet
0432 XML DTD and XML Schema
32 pages
E Tensible Arkup Anguage Unit-3: Basic XML DTD XML Schema Dom Vs Sax Presenting XML
No ratings yet
E Tensible Arkup Anguage Unit-3: Basic XML DTD XML Schema Dom Vs Sax Presenting XML
39 pages
Chapter 3-The Client Tier
No ratings yet
Chapter 3-The Client Tier
66 pages
Module 2 PDF
No ratings yet
Module 2 PDF
25 pages
XML and DTD: Mario Alviano
No ratings yet
XML and DTD: Mario Alviano
51 pages
Unit 3 - XML
No ratings yet
Unit 3 - XML
44 pages
unit_9_XMLandJAva
No ratings yet
unit_9_XMLandJAva
70 pages
Unit 4 (A)
No ratings yet
Unit 4 (A)
45 pages
Web Technologies Notes Unit 3
No ratings yet
Web Technologies Notes Unit 3
18 pages
Introduction To XML For Iseries Developers: Applications Systems Group September 18, 2002
No ratings yet
Introduction To XML For Iseries Developers: Applications Systems Group September 18, 2002
30 pages
4 XML and PHP
No ratings yet
4 XML and PHP
34 pages
Unit Ii
No ratings yet
Unit Ii
106 pages
Unit 4 STUDY MATERIALS
No ratings yet
Unit 4 STUDY MATERIALS
8 pages
Web Data: XML
No ratings yet
Web Data: XML
13 pages
XML notes
No ratings yet
XML notes
19 pages
Unit 3 - XML
No ratings yet
Unit 3 - XML
44 pages
XML
No ratings yet
XML
33 pages
Lesson+9.1+ +Data+Representation +XML+and+JSON+ +STUDENT+
No ratings yet
Lesson+9.1+ +Data+Representation +XML+and+JSON+ +STUDENT+
83 pages
XML, Ajax and PHP
No ratings yet
XML, Ajax and PHP
40 pages
Introduction To XML
100% (1)
Introduction To XML
35 pages
Extensible Markup Language
No ratings yet
Extensible Markup Language
74 pages
WP-Unit5
No ratings yet
WP-Unit5
17 pages
Iwd Unit 5
No ratings yet
Iwd Unit 5
42 pages
IWT unit-IV
No ratings yet
IWT unit-IV
10 pages
Representing Web Data: Unit Iv
No ratings yet
Representing Web Data: Unit Iv
15 pages
XML
No ratings yet
XML
27 pages
unit-4 ET
No ratings yet
unit-4 ET
15 pages
Unit - I
No ratings yet
Unit - I
112 pages
XML For RPG Programmers: An Introduction: OCEAN Technical Conference Catch The Wave
No ratings yet
XML For RPG Programmers: An Introduction: OCEAN Technical Conference Catch The Wave
18 pages
CS549 Distributed Information Systems: Lecture 2: XML and Internet Databases
No ratings yet
CS549 Distributed Information Systems: Lecture 2: XML and Internet Databases
50 pages
Presented by Guide
No ratings yet
Presented by Guide
21 pages
Unit - 4 XML
No ratings yet
Unit - 4 XML
82 pages
XML Extensible Markup Language
No ratings yet
XML Extensible Markup Language
27 pages
XML: Introduction To XML, Defining XML Tags, Their Attributes and Values, Document Type Definition, XML Schemas, Document Object Model, XHTML. Parsing XML Data - DOM and SAX Parsers in Java
No ratings yet
XML: Introduction To XML, Defining XML Tags, Their Attributes and Values, Document Type Definition, XML Schemas, Document Object Model, XHTML. Parsing XML Data - DOM and SAX Parsers in Java
36 pages
Module 8 (XML)
No ratings yet
Module 8 (XML)
61 pages
XML Data Format
From Everand
XML Data Format
Lucas Lee
No ratings yet
Proposal Defense Rubric-111
No ratings yet
Proposal Defense Rubric-111
2 pages
The Great Gatsby (1925)
No ratings yet
The Great Gatsby (1925)
100 pages
1 Pedagogy Vs Andragogy
No ratings yet
1 Pedagogy Vs Andragogy
3 pages
Janessa Resume
No ratings yet
Janessa Resume
7 pages
Sea Marshall AU9 Operator Guide Rev 011 Low Res
No ratings yet
Sea Marshall AU9 Operator Guide Rev 011 Low Res
56 pages
FFA Banquet Script
No ratings yet
FFA Banquet Script
8 pages
Chocolate Date Cake
100% (3)
Chocolate Date Cake
3 pages
Meinl Percussion Catalog 2018
No ratings yet
Meinl Percussion Catalog 2018
160 pages
Declaration of Heirship and Sale - Enclonar
No ratings yet
Declaration of Heirship and Sale - Enclonar
2 pages
Workers' Organization Development Program (Wodp) Dole-Ncr
No ratings yet
Workers' Organization Development Program (Wodp) Dole-Ncr
15 pages
Pre Colonial Music and Dance Report
No ratings yet
Pre Colonial Music and Dance Report
16 pages
Revised Kanda NCP
No ratings yet
Revised Kanda NCP
3 pages
Assessing and Testing Hydrokinetic Turbine Performance and Effects On Open Channel Hydrodynamics - An Irrigation Canal Case Study
No ratings yet
Assessing and Testing Hydrokinetic Turbine Performance and Effects On Open Channel Hydrodynamics - An Irrigation Canal Case Study
40 pages
Civilbaba Jobs Record
No ratings yet
Civilbaba Jobs Record
43 pages
AT&T
No ratings yet
AT&T
20 pages
Grade 11 Week3 PDF
No ratings yet
Grade 11 Week3 PDF
2 pages
Heinrich Himmler: The Nazi Hindu
No ratings yet
Heinrich Himmler: The Nazi Hindu
5 pages
Fractions Review Lesson - ps1
No ratings yet
Fractions Review Lesson - ps1
2 pages
Kishor
No ratings yet
Kishor
16 pages
Proposal Paper - Campus Dining-1
No ratings yet
Proposal Paper - Campus Dining-1
7 pages
Tender 0 buycon43.BKPL - Barauni
No ratings yet
Tender 0 buycon43.BKPL - Barauni
105 pages
Equity - Question Sheet
No ratings yet
Equity - Question Sheet
118 pages
FARIGITIS - JURNAL, Edward
No ratings yet
FARIGITIS - JURNAL, Edward
8 pages
Allied Bank V Lim Sio Wan
100% (1)
Allied Bank V Lim Sio Wan
3 pages
UG Zoology Syllabus NEP 2020
No ratings yet
UG Zoology Syllabus NEP 2020
51 pages
People vs. Siyoh L-57292 Contention of The State
No ratings yet
People vs. Siyoh L-57292 Contention of The State
1 page
Chapter 3: Water Pollution: Introduction (1, p.187)
No ratings yet
Chapter 3: Water Pollution: Introduction (1, p.187)
32 pages
SOPs For IQMS in Food Manufacturing Facilities
No ratings yet
SOPs For IQMS in Food Manufacturing Facilities
7 pages
Schnetzler IMS
No ratings yet
Schnetzler IMS
8 pages