Lecture 5 - Semi-Structured Data
Lecture 5 - Semi-Structured Data
11
Attributes vs. Subelements
• Distinction between subelement and attribute
• In the context of documents, attributes are part of markup, while
subelement contents are part of the basic document contents
• In the context of data representation, the difference is unclear and
may be confusing
• Same information can be represented in two ways
<account account_number = “A-101”> …. </account>
<account>
<account_number>A-101</account_number> …
</account>
• Suggestion: use attributes for identifiers of elements, and use
subelements for contents
More on XML Syntax
• Elements without subelements or text content can be abbreviated by
ending the start tag with a /> and deleting the end tag
• <account number=“A-101” branch=“Perryridge” balance=“200 />
• To store string data that may contain tags, without the tags being
interpreted as subelements, use CDATA as below
• <![CDATA[<account> … </account>]]>
Here, <account> and </account> are treated as just strings
CDATA stands for “character data”, text that will NOT be parsed by a parser
XML Document Schema
{
"employee_id": 1234567,
"name": "Jeff Fox",
"hire_date": "1/1/2013",
"location": "Norwalk, CT",
"consultant": false
}
JSON Data – A name and a value
• In JSON, values must be one of the following data types:
• a string
• a number
• an object (JSON object)
• an array
• a boolean
• null
{
"employee_id": 1234567,
"name": "Jeff Fox",
"hire_date": "1/1/2013",
"location": "Norwalk, CT",
"consultant": false
}
JSON Data – A name and a value
• Strings in JSON must be written in double quotes.
{ "name":"John" }
{"employees":[
{ "firstName":"John", "lastName":"Doe" },
{ "firstName":"Anna", "lastName":"Smith" },
{ "firstName":"Peter", "lastName":"Jones" }
]}
JSON Parsing
import json
json_string = '{"first_name": "Guido", "last_name":"Rossum",
"phone":[9098693256, 9097846521]}'
parsed_json = json.loads(json_string)
data = DataFrame(parsed_json)
print(parsed_json['first_name'])
phone = list(parsed_json['phone'])
Note: For external file read, use
print(phone) json.load and data pretty print to
display the content of json file.
print(data)
from pprint import pprint
data = json.load(open('data.json'))
pprint(data)
Class Activity 10
• Convert the following bookstore.xml to bookstore.json
<?xml version="1.0"?>
<bookstore>
<book category="sci-fi">
<title lang="en"> 2001</title>
<author>Arthur C. Clarke</author>
<price>$30.0</price>
<year>1968</year>
</book>
<book>
<title lang="rs">Story about a True Man</title>
<author>Boris Polevoy</author>
<price>$20.00</price>
<year>1952</year>
</book>
</bookstore> 24
XML vs JSON
• JSON is Like XML Because
• Both JSON and XML are "self describing" (human readable)
• Both JSON and XML are hierarchical (values within values)
• Both JSON and XML can be parsed and used by lots of programming languages
Using JSON
5. Fetch a JSON string.
6. Parse the JSON using JavaScript functions.