Monday 25 May 2015

Rethinking database schema with RDF and Ontology

When I joined the industry 10 years ago, my first project used relational database. After that, my next project also used relational database. And as you may guess, my next next projects also used relational database. This went on for so long that I almost forgot that table is just one format to store data.

I only found myself interested in other kind of databases 4 years ago when my company slowly moved to BigData analysis and knowledge management. Over these years, the exposure I have with RDF and Ontology has given me an urge to re-visit and re-think about the approach and principle to build database.

In the scope of this article, I will solely focus my thought on the role of database schema. Please do not feel worry even if the term RDF and Ontology sound unfamiliar to you, I will try my best to introduce these concepts in this article.

Background

Graph as knowledge database

As everyone know, creating relational database start with defining the database schema. It doesn't matter which database we choose, we still need to define tables, columns and any foreign key before inserting data. Apparently, we need to decide how the data would look like before making it.

However, this is not a suitable approach for us when building knowledge database. Because the system suppose to store future knowledge rather than current knowledge, it is impossible to figure out how the data will look like. Therefore, we turned our focus out of relational database and looked for other solutions.

There are many NoSQL databases that can support schema-less data but most of them does not fit our requirement because we want to stored linked data. Hence, graph database seems to be the most sensational choice for us.

Resource Description Framework

A round of shopping in the market give us an unease feeling as there is no widely adopted standard for query language. If we choose a graph database, we may end up writing vendor specific implementation, which doesn't seem to be a good strategy to start.

In an attempt to find some common standards, we managed to find Resource Description Framework, a W3C specification for modelling of information in web resources. It seems to be the best choice for us because Resource Description Framework comes with a very simple mechanism to describe resource linkage. The only side effect that every resource need to be identified by an URI.

For example, to describe that Apple produce iPhone 5, which being sold at 600 USD, we need to generate 2 triples like below.

<http://example.org/Apple> <http://example.org/produce> <http://example.org/iPhone5>
<http://example.org/iPhone5> <http://example.org/price> '600 USD'@en

We have no interest in using URI as resource identifier but still need to do so in order to comply to standard. However, we cannot blame RDF because it was invented for the web. The author has a nice dream where we can follow the URI of resource to retrieve it. Obviously, this idea did not come true.

Leave the original idea of RDF one side, the major benefit for us is the query language SPARQL. For example, to figure out the price of any Apple phones, the query should be:

select ?phone ?price where {
       <http://example.org/Apple> <http://example.org/produce> ?phone .
       ?phone <http://example.org/price> ?price
}

However, RDF is just a concept. In order to use RDF and SPARQL, we still need to find a decent API or implementation, preferably in Java. This leaded us to 2 popular choices, Sesame and Apache Jena. Both APIs are widely accepted and have several vendor implementations. Even better, some vendors provide implementations for both of them.

Ontology

In the example above, it is easy to see that to make a meaningful query, we need to know how the data look like. Therefore, we still end up should have some kinds of data schema. However, this schema should act more like meta-data rather than data definition. Generally, we do not need the schema to be pre-defined before inserting data.

To address this issue, there are two strategies. People started with defining a common vocabulary for RDF. Because most of the time, we already know the relationship among resources but do not know the resources themselves, the vocabulary should be good enough to help forming SPARQL query. However, due to the wide scope of describing the world, no vocabulary is enough to fully express every domain. Therefore, there are many domain specific vocabularies.

While the first approach only tackles describing possible relationship among resources, the second approach attempts to describe resources as well. For example, to describe the resources in the previous example, we can define a class named smart phone and a class named manufacturer. After this, we can specify that a manufacturer can produce a phone.

<http://example.org/Apple> <http://example.org/type> <http://example.org/Manufacturer>
<http://example.org/iPhone5> <http://example.org/type<http://example.org/Phone>
<http://example.org/Manufacturer> <http://example.org/produce> <http://example.org/Phone>

Those triples above form an Ontology. Compare with vocabulary, we found Ontology as a more descriptive way of describing data schema because it can tell us which kind of relationship is applicable to which kind of resource. Therefore, we will not waste time figuring out whether iPhone 5 produces Apple or Apple produces iPhone 5.

Ontology plus RDF is a good choice to build knowledge database. While the repository can be an RDF store which can take in any triple, we can build another ontology in a separate space to model the knowledge in main repository.

From our point of view, it is better to use ontology to form query rather than data validation because it will help to slightly decouple data and schema. In practice, it is perfectly acceptable to insert data that does not fit Ontology, as long as they does not contradict.

For example, with the Ontology defined earlier, we should allow inserting knowledge like

<http://example.org/Apple> <http://example.org/produce> <http://example.org/iPod>
<http://example.org/Apple> <http://example.org/produce> <http://example.org/iPad>

but we can reject any knowledge like

<http://example.org/iPhone5> <http://example.org/produce> <http://example.org/Apple>

With this approach, we can allow importing data first, then modelling them later.

Database Schema

A refreshing thought

Relating our approach on building knowledge database with relational database make me feel that the requirement of defining data schema before data insertion is driven by implementation. If we can temporarily forget practical concerns like data indexing, it is possible to insert data first and define data schema later.

This also brought me another thought of whether it is feasible to have more than one data schema for the same piece of data. The idea may sound awkward at first but quite realistic if we look at the daunting task of modelling the world. For example, if we think of Obama as the US president, we may care about the when he started serving, when will he leave office; but if we think of Obama as any other US citizen, then we care more about date of birth, residential area, security number,... In this way, the schema is serving as a perspective for us to inspect and modify resource.

So, if I can travel back to the time people discussing a common query language for database, I would suggest adding some features to SQL to enrich it, or to introduce a new kind of query language that is less strict:

  • Allow insertion without table definition. Automatically create table definition following insertion command parameters.
  • Make the id of record unique per database rather than unique per table. A record can appear in many tables with the same id. An insertion command need to specify which field is the id for the record.
  • The data definition is more generic without size constrain (varchar and int instead of varchar(255) or int(11)).
  • The select command must comply to the table definition. 
  • It is possible to share a field between two tables for the same record.
Before wrapping up this article, let try to do an quick exercise of building an database that can implements these extended features. The underlying storing system can be any schema-less engine but we can use RDF store for simplicity.

Requirements
  • Insert Obama into US citizen table with name and age and gender. The identifier field is name.
  • Insert Obama into US president table with name, age and elected year. The identifier field is name.
  • Define US citizen table with field name and age.
  • Define US president table with name, age and elected year.
  • Select record from US citizen table will only show name and age as gender is not defined in table definition.
  • Update Obama record in US President table with new age will affect the age in US citizen table because it is a sharing field.
Implementations

Step 1
  • SQL: insert into citizen(name, gender, age) value ('Barack Obama', 'Male', 53)
  • Triples:
    • <Barack Obama> <type> <citizen>
    • <Barack Obama> <name> 'Barrack Obama'
    • <Barack Obama> <gender> 'Male'
    • <Barack Obama> <age> 53
Step 2
  • SQL: insert into president(name, elected_year) value ('Barrack Obama', 'Male', 53)
  • Triples:
    • <Barack Obama> <type> <president>
    • <Barack Obama> <elected_year> 2009
Step 3
  • SQL: create table citizen ('name' varchar, 'age' int, primary key ('name') )
  • Triples:
    • <citizen> <field> <name>
    • <citizen> <field> <age>
    • <citizen> <primary_key> <name>
Step 4
  • SQL: create table president ('name' varchar, 'elected_year' int, primary key ('name') )
  • Triples:
    • <president> <field> <name>
    • <president> <field> <elect_year>
    • <president> <primary_key> <name>
Step 5
  • SQL: select * from citizen
  • SPARQL: select ?record ?field_name ?field_value where {
    • ?record <type> <citizen>
    • <citizen> <field> ?field_name
    • ?record ?field_name ?field_value
  • }
 Step 6
  • SQL: update president set age=54 where name='Barack Obama'
  • Overwrite <Barack Obama> <age> 53 with <Barack Obama> <age> 54

Conclusions

I think some of the ideas above a bit hard to digest if audiences are not familiar with RDF or Ontology. Still, I hope it can raise some more thoughts on bridging the gap between relational database and knowledge database.

Tuesday 12 May 2015

First Agile impression

Last year, we had a mass recruitment for Java developers with various level of experience. Unfortunately, from this part of the world (Asia) Agile has not been very widely adopted. Therefore, we ended up spend extra effort getting new members to be familiar with Agile and XP.

Over the team forming process, which lasted about 6 months, most of the new team members provided positive feedback about working in an Agile environment. In this article, I would like to share with audiences one of the feedback that we have. Enjoy!

It has been a quickfire six months working in an agile environment. This is my first foray into agile software development in nearly a  decade of working in software. Hearing and reading about it could not have prepared me for the actual experience of working in this peculiar environment. However, the overwhelming sentiment for me is - what a breath of fresh air!

The morning standups were unsettling, to put it mildly. Suddenly, you stopped having excuses for not having done anything the day before :). They take some getting used to, but slowly I’m learning that it is more a sharing session than a status update. Easier said than done, but I’m getting there.  Not sure what you're going to do today? Tell everyone - there’s always something to do.

I like the inherent openness that Agile brings to the table. Seemingly mundane things like outstanding tasks become more explicit and we are all the better for it. Finish a task at hand? No one's stopping you from going to the board and picking a new one. Stuck with a sticky issue? Bring it up in the standup and more often than not, offers of help can be expected. Oftentimes though, a quick holler is all that will be required.

Whole days dedicated to planning and retrospectives demand concentration and focus and more often than not, creativity. I have found that they mean less disruption during the iteration for the most important work of all - writing good code. Worthwhile, no?

We practice pair programming. For someone new to this, the intensity is unexpected as you try to  align yourself to your pair's thought process in a continuous back-and-forth cycle. Overall, I have found it to be quite draining, but I believe the upside cannot be underestimated. Each pairing session, even with the same colleague, seems to involve a whole new dynamic and the constant adjustment needed can be likened to a skill. Working in such close proximity can be a double-edged sword, though. Some friction is inevitable and I have often experienced a whole range of emotions whilst working through a pairing session. I try to manage these emotions and reflect upon them afterwards to understand why I’d felt the way I’d felt and how I could have done better. It is undeniably rewarding though - I have learnt a great deal about myself and my pairs during those sessions.

Test-driven development is standard practice here. Mastering, or rather, adhering to the red-green-refactor pattern seemed counter-intuitive at first, but it starts to make sense after a while. In my limited experience, writing tests becomes more fluid with practice. Since there is almost never an excuse for not providing test coverage for any code that will see the light of day (much less production code!), buckling down and writing that test will be a good habit to develop and one which I am convinced will prove to be an invaluable skill and an integral part in my journey to become a better developer.

Building software has never been quite so engaging and dare I say it..fun. Having a close knit team definitely helps. Hopefully, we can keep the good momentum and spirit going as we welcome new members into our fold and the workload ramps up. It will not be easy, but nobody said it would be. Whatever the future holds, I await with bated breath. Onwards.


Zhi Liang