In parts 1 and 2, we saw how relational databases need us to break up data and store it in little pieces. We specifically talked about splitting up data into different tables in part 1. In this part, we will see how we can get the different pieces of data from different tables together again and what kind of problems this merging causes.
Recall
In part 1 of this series, we ended up with two tables, which look like the following.
Contact serial number | Own name | Surname |
---|---|---|
1 | Sanjaya | Chowdhury |
2 | Purushottama | Kulkarni |
Contact serial number | Phone type | Phone number |
---|---|---|
1 | Personal | 9422515373 |
1 | Work | 9758085846 |
2 | Personal | 9169362975 |
2 | Backup | 9707818584 |
Rules mandating that only data which cannot be broken up further can be stored in relational databases forced us to make the above two tables. The first table contains a list of contact names and assigns each one an ID. The second table contains the list of phone numbers corresponding to each contact. This ensures that each table serves exactly one purpose, as per relational database rules.
Bringing the data together
It is possible to create a relational table which brings the data from the two tables together. To find out corresponding records, the procedure matches data from columns which are common to both tables. In our case, the ‘contact serial number’ acts as the glue that holds the records in the two tables together. The serial numbers are matched and a entire row of information is built. This procedure is called ‘JOIN’ing tables. Here is how the joined table looks.
Contact serial number | Own name | Surname | Phone type | Phone number |
---|---|---|---|---|
1 | Sanjaya | Chowdhury | Personal | 9422515373 |
1 | Sanjaya | Chowdhury | Work | 9758085846 |
2 | Purushottama | Kulkarni | Personal | 9169362975 |
2 | Purushottama | Kulkarni | Backup | 9707818584 |
What just happened!
You can see the problems of ‘join’ing straightaway. Contact serial number was used as the common ground between the two tables. And it appears more than once in the second table. This means that the data repeats. The number of repetitions is dictated by the table where the same data appears multiple times. But there’s worse!
The problem of combinatorics
Let’s say that we are joining 2 tables, with table 1 having 3 rows with the same data in common field and table 2 having 2 rows with the same data in the common field. The number of repetitions will be a multiplication: 3 x 2. We will see 6 rows with nearly similar data, all being the possible combinations of the rows with the common field. Let’s see an example.
Table 1: Contains a list of customer orders
Customer ID | Order date | Item | Quantity |
---|---|---|---|
C00001 | 1st December 2016 | Farm Fresh Pizza | 2 |
C00001 | 7th December 2016 | Garlic bread sticks | 5 |
C00001 | 15th December 2016 | Choco Lava cake | 2 |
Table 2: Contains a list of all possible customer delivery addresses
Customer ID | Delivery Address |
---|---|
C00001 | Somewhere in Ghatkopar, Mumbai |
C00001 | Somewhere in Vashi, Mumbai |
Joined:
Customer ID | Order date | Item | Quantity | Delivery address |
---|---|---|---|---|
C00001 | 1st December 2016 | Farm Fresh Pizza | 2 | Somewhere in Ghatkopar, Mumbai |
C00001 | 1st December 2016 | Farm Fresh Pizza | 2 | Some in Vashi, Mumbai |
C00001 | 7th December 2016 | Garlic bread sticks | 5 | Somewhere in Ghatkopar, Mumbai |
C00001 | 7th December 2016 | Garlic bread sticks | 5 | Some in Vashi, Mumbai |
C00001 | 15th December 2016 | Choco Lava cake | 2 | Somewhere in Ghatkopar, Mumbai |
C00001 | 15th December 2016 | Choco Lava cake | 2 | Some in Vashi, Mumbai |
Well, all that an innocent data analyst wanted was to be able to generate a report page which contains a list of all the orders by a customer and all the possible delivery addresses where he wants orders delivered. However, after joining the resulting table, which acts like a lesson in mathematical combination, makes no sense!
And also note that this confused set of records was for ONLY ONE customer C00001. Imagine what would happen with the data spanning 1000 customers!
Problem of computation and memory
Joining is an intensive process. It matches rows from multiple tables and creates new bigger rows. This consumes a lot of processor and memory. Many reports, which depend on joining tables, are set up to generated overnight or during weekends. Real-time reporting is nearly impossible, unless you have a super computer.
Problem of fragmented solutions
We saw how the joining process got confused if both the tables have common data appearing multiple times. Due to this, programmers avoid joining tables all together and querying the database multiple times in succession, each time extracting different pieces of data from different tables.Whenever we query a database, each query is held in a queue until the database is ready to get to it. Querying the database multiple times for fragments of data is tedious and can lead to programs parked in the queue for too long, leading to delays. We are better off extracting larger chunks of data with fewer queries, whenever possible, to avoid standing in the queue all day.
The elegance of NoSQL
In part 1, we already saw how tight and neat the table is for non-relational databases. Here is the non-relational example for the contacts database.
Serial Number | Name | Phones |
---|---|---|
1 | Sanjaya Chowdhury | Personal: 9422515373 Work: 9758085846 |
2 | Purushottama Kulkarni | Personal: 9169362975 Backup: 9707818584 |
Now let us look at how we solve the problem of customer orders and delivery addresses.
Customer ID | Orders | Addresses |
---|---|---|
C00001 | ( Date: 1st December 2016: ( Item: Farm Fresh Pizza Quantity: 2 ), Date: 7th December 2016: ( Item: Garlic Bread sticks, Quantity: 5 ), Date: 15th December 2016: ( Item: Choco Lava cake Quantity: 2 ) ) | ( Somewhere in Ghatkopar, Mumbai Somewhere in Vashi, Mumbai ) |
Neither is data repeated, nor will we need to query seperately for seperate fragments of info.
Conclusion
We saw how joins are useful in bringing fragments of data together in relational tables, but how they cause data repetition and confusing results. We also saw how non-relational databases’ flexibility allows us to arrange information in new ways by nesting data in data and how it avoids confusing results and repetitive querying.