The Database Classifications and The Main Concepts

Vachagan Mirzoian
23 min readMar 9, 2022

Programming is the job done with a specific algorithm with data. But first you have to keep some data. They are stored in a database. Many queries defined by Query language are used to get the information. These languages are made up of 5 logical parts.

1) Data Definition Language, Data Description Language (DDL) — CREATE, ALTER, DROP TRUNCATE,

2) Data Manipulation Language (DML) — INSERT, UPDATE, DELETE,

3) Data Control Language (DCL) — GRANT, REVOKE,

4) Transaction Control Language (TCL) — COMMIT, ROLLBACK

5) Data Query Language (DQL) — SELECT

There are many principles for classifying databases and their management systems. According to business activity, one can distinguish Cloud technologies and Large warehouses, which are for meeting the needs of the business.

According to the territorial distribution, we separate a distributed database, the data of which can be physically isolated from each other, but servers are equal, the data, being in separate computers, are processed by different processors, the servers are interconnected by the Internet. They can be in the same Data center at the same physical address, but regardless of the geographical location of its sub-elements, the whole database is 1 logical unitCluster. In order to keep the passages adequate, there must be an intensive connection between them, for Replication and Duplication. Distributed databases are Homogenous or Heterogenous. In the case of homogeneous systems, all servers use the same DBMS.

There is also a Centralized database, which is completely in one place. This option is often used in the case of LAN networks, when the organization occupies a limited space and uses a central server in which the database is located. There are several centers in a Decentralized system, like a centralized database, the sub-elements may differ in their rights and capabilities.

In terms of software architecture, databases can be either stand-alone systems from the client that uses them, or they can be embedded in the program, used as a library or any other project, such as SQLite. Other example is SQL Server Compact, working by compacting data to write more code in smaller volumes or by compressing existing code by pressing the code into less physical memory.

In order to achieve speed, parallel databases are designed, especially for processing large volumes of information.

Data of database can be Structured or Unstructured, intermediate types are possible. The data that has a Data Model is Structured. There are also Semi-structured types.

Depending on the structured or unstructured being of the data model, databases also have models, relative or non-relative but some databases may work with more than 1 model. The classic example of relational databases is SQL. Non-relative databases are key-value based ones, Wide column, document orientated, some graph databases, and so on. With Search Engine and multi–model databases include properties of 2 groups.

The table shows a comparison of the features of the 5 fundamental models.

Relational Database — Data is stored in tables, where Cartesian product is mathematical base allowing to combine or relate rows and columns with each other.

Classic examples are SQL, T-SQL and MySQL. Since a table is an unique and independent unit in a SQL database, all rows and columns must be equal, so before extending any row, a column must be added to the table, put the value at the intersection with the required row, and write “null” in the other fields.

We add a column, then delete the whole column, that is, it is not possible to delete only 1 specific box. Column-oriented databases were designed to correct this shortcoming. It can be compared to the work done with dynamic and static arrays, when in the case of dynamic arrays the length changes, and in the case of static arrays the best option is to write “null” or 0s. In relational databases, table normalization is extremely important. Data, relating to object, depending on the type, are grouped into separate tables so that only the homogeneous data is found in each table. In the case of a non-relative database, the specific object with all its fields is stored in a separate document, so mixed data are in one place.

Object Relational Mappers are used to communicate the data querying language and the server–side language. It transforms the table relationships into objects or classes used by most programming languages, allowing them to break away from the many intertwined relationships between tables. SQL and .NET use Entity Framework, which allows you to work directly with the database using any .NET language, such as C# , but at the same time stay away from the database layer. So the Entity Framework is an interface without which we had to use ADO.NET, which involves writing code in SQL style.

The table shows the possible concepts և their meaning.

Key-value pairs — data are objects of the Associative array working with the Hash Table”, or are elements of the Dictionary, in other words, row is the data, where each row is a separate unit. Unlike other databases, in this case the empty fields are not filled with placeholders, due to which the memory is not overloaded. It is the simplest form and the main abstraction of non-relative bases. Many applications, using languages like Redis, operate on an in-memory basis, that is, information is transferred to a client computer. It is effective for caching, real-time operations, etc.

Graph — The main information is not the static data but the connections and relationships between the elements of the database, in other words, it works with a relationships-first approach. It is based on directional graphs, so it has vertices — Nodes, which are objects of any type, so can contain fields. Graph has also edge that shows the connections between vertices and properties that store additional information. Typical systems are TigerGraph with GSQL, and Neo4j with Cypher languages. ArangoDB, for example, in addition to graphs works also with documents and key-value pairs. Allows you to avoid Associative entities, Joins e Lookup tables that are available for relative databases and provide additional schemes, connections and more complex architecture. Since graphic databases focus on relationships, this approach is most convenient for storing interconnected data having complex relationships. An example of such a problem is Chain optimization, which occurs in production chains, when you have to get different parts from different suppliers and get the final product. Logistical problems can be improved indefinitely using graphs. It is used to ensure internet security, to prevent bank card fraud, when it is necessary to analyze many consecutive transactions, to find deviations. This may require machine learning elements.

Document-oriented Database — are intended for general use, so non-homogeneous data can be stored in the same document, which causes normalization problems. It is the main type of non-relative databases and a subset of key-value pair databases. It is the diametrically opposite concept of the relative databases. Document–based and other non-relational databases use Object Document Mapping, but this is troublesome because unlike relative databases, non-relative databases have no standardization, each database has its own uniqueness, and developing a limited-use ODM is not economically viable. The information is stored in the document as JSON, XML or other format. In the database domain, the document is accessible through a unique index or key, as well as through metadata. Examples of databases are MongoDB and RedisJSON.

Wide Column Database — information is the full column, each column is a separate file and can be stored separately in memory. Obtaining specific data is faster because the column contains the same type of data that is serialized or deserialized quickly. At first glance, the column–based database may look very similar to relative ones, but the columns can be easily changed. Can be represented as a two-dimensional key-value pair.

There are different implementations of column-based databases, in the context of each implementation the line or column priority is different, that is, in the database, in addition to the column, the row can also have a role. Or the database management system, “DBMS” can have an importance in Column-oriented databases. In the case of a Super column database, each element, column, can also be a compound and containing more than 1 sub-element. The picture shows how other virtual sub-tables are created in the table.

A special case is “JSON”, which contains an object, which has sub-objects, they also have their sub-objects, and etc.

Looking at the appearance of the tree, we see the existence of subfields:

The whole database is one Keyspace.

The obvious examples are time series, such as temperature, seismic activity, biometric data, and other periodic events. For example, one day of the month is the key for the column, and the data on seismic activity is the entire line data. Technologies are CQL language, Apache HBase և ScyllaDB.

Search Engine Database — designed to work with large amounts of data in almost real time. Well-known technologies are Solr and Elasticsearch. Elasticsearch provides REST APIs and uses schemeless JSON format documentation. Queries between endpoints are executed via HTTP verbs, like GET, POST, PUT or PATCH and DELETE. The documents are located on different servers, so the queries work in parallel. The database is pre-indexed, allowing to retrieve the data faster.

Multi model — combines the principles of relative and non-relative databases. An example of a database is the Fauna API, which is primarily compatible with GraphQL, but many programming languages can use Fauna for their own queries as well. JSON is relatively infrequently used, and Replication is not required here.

When implementing database concepts in real life, we have 2 main models, relational and non-relational.

When working with databases, we are dealing with a number of concepts, for example:

1. Database

2. Cloud,

3. Data Warehouse and Data Lake,

4. OLAP and OLTP,

5. Quantitative and Qualitative,

6. Data model, Database model

7. Structurized and Unstructured,

8. Replication and Duplication,

9. Functional Dependency, Partial Dependency, Transitive Dependency

10. Key,

11. Normalization, Normal Forms,

12. Partitioning and Sharding,

13.CRUD,

14. Indexing,

15. View,

16. Union, Intersect, Except

17. Transaction,

18. Stored Procedures

19. JOIN

20. Triggers

21. Schema

1. Database — an abstract storage environment, which often occupies real memory and physical devices, of course, there are specific modifications for each situation.

2. Cloud — remote server system and database for business activities.

3. Data Warehouse — a centralized system for business intelligence and scientific analysis. They are able to process petabytes of information, unlike classical databases, super computers are used as hardware supply. Theoretically uses Structured Data and Semi-structured Data, consequently has normalization, data purity and uniqueness. Has a three-layer architecture:

1) Bottom layer — consists of the main server, which is usually a relational database, performing ETL or ELT process,

2) Middle layer — consists of OLAP, Online analytical processing layer, which speeds up queries,

3) Top layer — is the front end or similar interface.

Combines information from transaction systems such as CRM or customer relationship management, HRM or human resources management and ERP or enterprise resource planning, real-time data processing databases, even non-relational databases like flat files, and other sources. It is available for end-use as software, cloud technology or hardware.

Conventionally, it can be compared with “Stack” when there is a regulation.

Data Lake — a general purpose system for binary large objects such as pictures, videos or audio. Unstructured data can be stored efficiently. In case of poor management, it turns into Data Swamp. Can be compared to the concept of “Heap”.

Data Warehouse and Data Lake comparison is on table.

4. OLAP or Online Analytical Processing — Multidimensional analysis of data ​​stored in Data Warehouse. Dimensions are location, type of object, time, etc. Because it uses two dimensioned relational databases, the tables are combined to form multidimensional data from layers. Queries are complicated and contain large number of lines. They do not change the existing data, as most of them only read them.

OLTP or Online Tansactional Processing — large-scale transaction processing. Ensures that the same information is used by many, also indexes the data. The queries are simple and contain a few lines. The change of data is fast. Online payment systems are a good example.

5. Quantitative data — only by their existence they already have an objective description, they are described by statistical indicators — quantity, value, size, etc. They can be discrete or continuous. They match with structured data.

Qualitative data — grouped data, grouping according any attribute. As the unification is a subjective phenomenon and can be carried out in many other ways, the qualitative data are unstructured or semi–structured, that is, there is a need for further clarification and classification. Match with Unstructured data.

6. Data Model — is obtained by modeling, gives the data a semantic structure, defines the relationship with each other and the similarity with the properties of real world objects.

Database Model — defines the logical structure of the database. Models are network, document, relational, Entity–attribute–value, etc.

7. Structured Data — Quantitative data consisting of numbers and value, presented in the form of tables, have predefined models, stored in Data Warehouses, queries are made in SQL or similar language. Structure of raw data that has not yet been processed, can get a structure even by marking it HTML language. The site can be optimized to become more noticeable by the search engine.

Unstructured Data — qualitative data, the initial state of data when the Data model is missing, therefore it is easy to get large quantities, which requires a lot of memory. Because it can be adapted to different file formats, it can be redefined for specific needs and achieve efficient use of resources. Use cases are Data mining, NLP and predictive analytics, more efficiently stored in Cloud Data lakes, developed in NoSQL systems.

To say briefly, database data can be Structured or Unstructured, mixed types are possible as well. The data that has a Data Model is structured, which is obtained by modeling, that is, it gives the data a semantic structure, defines the relationship between them, the similarity with the properties of real-world objects. There are also Semi-structured types.

8. Replication — the special software finds the changes made in different parts of the database and makes all the pieces to look the same, that is, either cancels the changes or spreads the change on other parts.

Duplication — 1 main database is chosen, and its content duplicated, further changes are made with that 1 main database. It is the same as Backup.

9. Functional Dependency — constraint, written “X → Y”, where “X” is associated with “Y”. For example, the vehicle identification number, VIN, associated with engine power, is written “VIN → Engine”, because a particular car has only 1 capacity. The notation “Engine → VIN” is incorrect, as there are many cars with the same engine capacity. In the table, the student can enroll in more than one lecture and have any number of semester.

When 2 rows in the table have the same “Student ID”, they have the same “Semester” value. Here we have a “StudentID → Semester” functional dependency. Another example of dependency is the {StudentID, Lecture} → {TA, Semester} relationship when we have a pre-selected {StudentID, Lecture} set of attributes, and {TA, Semester} where {StudentID, Lecture} is a super key.

In the following example, each employee can have only 1 “Department ID”. Since everyone can work only in one department, Employee ID defines the department also, like {Employee ID → Department ID}.

Employee ID defines the employee name — {Employee ID → Employee Name}. There is also a functional dependency in the table with a non-key attribute — {Department ID → Department Name}. Although there is a functional dependency on {Employee ID → Department ID}, getting “Department ID” using “Employee ID” is not logical.

Partial Dependency — occurs when a non-key column is functionally dependent on one of the parts of candidate key columns.

Consider the example.

The primary key consists of StudentID and ProjectNo columns. The StudentName and ProjectName columns must depend on the primary key. StudentName can be determined by StudentID, and ProjectName can be determined by ProjectNo. So we have a partial dependency. This also violates the rules of the 2nd normal form.

Transitive dependency — a type of functional dependency when the 1st column is connected to the 3rd through 2. The table below shows an example:

If we know the book, then we also know the author’s nationality, {Book} → {Author nationality}, because using the book we first find the author, {Book} → {Author}, and then using the author, we find the author’s nationality, {Author → Author Nationality}. Here the non-key attribute defines another non-key attribute.

10. Key — an attribute or column that allows you to identify a row. There are groups.

1) Super Key — a set of attributes that uniquely identifies each row, has no minimum size requirements, so may include other more compact keys. Defines functional dependency.

2) Candidate key — like a “Super Key” but with the possible minimum size. Columns that are part of this are called “Prime Attribute”, other attributes are called “Non-Prime attribute”. A table with no “NULL” value has at least 1 “Candidate key”. It is all the possible combinations to identify the line.

3) Primary Key — a selection consisting of a minimum set of attributes that uniquely defines the row. It is a special case of “Candidate key”. The maximum can be 16 columns and 900 bytes.

4) Foreign key — a set of columns consisting of the primary key of another table. The table containing the “Foreign Key” is called the child, and the one containing the primary key is called the parent. The key written in the child table is either equal to the primary key of another table or has a NULL value. This is called Referential integrity.

5) Simple key — a key consisting of only one attribute, a column.

6) Concatenated key — consists of 2 or more columns.

7) Compound key — consists of 2 or more columns, which are the primary key showing another table.

8) Composite key — a special case of a primary key, consisting of 2 or more columns, at least 1 of which is not a simple key. Consider the table:

It is not possible to get a value with only 1 field, as there are duplicate values in each column. Combining the values of the “Team” and “Squad number” columns we create the Primary key.

9) Natural key — a unique identifier, which also coincides with the object of the real world, has a business significance.

10) Surrogate key — a unique identifier, which itself is a value, is more concerned with the logical structure of data, the Data Model, than with the physical structure of memory, the Storage model. It can act as a primary key in the context of some specific database. Is a key with an auto-generated value when registering test values in the database.

11. Normalization — The process of building relational databases using normal forms to reduce data duplication and to increase data integrity. Columns and Tables are organized so that dependencies are implemented according to the constraints of database integrity. It is carried out either by synthesis of bases or by decomposition, that is, by building or destroying new hierarchies of tables. Bringing a normal form can also have a negative effect on speed, make it difficult to write queries, etc., and may require de-normalization. The concept applies only to relative databases, although non-relational databases can also be improved but with other concepts and approaches.

Normal Forms — Strategies used for normalization, each subsequent level satisfies the previous level as well.

1st normal form — each column must have a uniquely valued cells, duplicate values ​​are not allowed. Instead of storing duplicate values ​​in a row having a specific Id, you should write the duplicate items separately in a separate table using the same Id.

Consider an example of an incorrect table.

After riding a normal horse, we have the following appearance.

2nd normal form — there is no non-prime attribute that is functionally dependent on any subset of candidate keys. That is, columns that are not primary, should not be functionally dependent on the primary key candidate columns. With a 1-column primary key, the table is always in the second normal form. Before becoming the 2nd horse, the table must be in 1st normal form.

The possible primary key in this table is a compound consisting of “Manufacturer” and “Model” columns. The “Manufacturer country” column is not the primary key, cannot even be considered as such, and it depends on the “Manufacturer”. This is a violation of the conditions of the 2nd normal form. To meet the requirements you must have 2 tables.

Table containing Partial dependency.

To bring the 2nd normal form, we divide it into parts.

In the following example also, the second normal form is violated.

The primary key consists of “PART” and “WAREHOUSE” columns, but the “WAREHOUSE_ADDRESS” column depends only on the “WAREHOUSE” column. The same warehouse may have different addresses in different rows. We should ensure every Warehouse always to have 1 address, that is, the Warehouse should be used as the primary key, written in a separate table.

3rd normal form — the columns that do not belong to the primary key, should not depend on the columns that do not belong to the key, non–key column should depend only on the key column. Defines duplication reduction, facilitates data management and Referential integrity. This table is already the 2nd form, but does not correspond to the 3rd normal form.

The Tournament and its Year, are enough to identify line uniquely, they are the primary key. But the problem is that the non-key column “Date of Birth” depends on the primary key with the “Winner” mediator column, so it is allowed that the same person can appear with different dates of birth. By dividing the table into parts, we solve the problem. The “Winner” column is the key to which only 1 value already corresponds.

Boyce-Code or 3–5 Normal Form — A table in the 3rd normal form that does not have matching candidate keys. There is an A → B dependency where “A” is a super key, so other columns, such as “B”, are found using super key. There should be only primary key.

4th Normal Form— Table does not contain 2 or more independent, multi-valued data. There are 2 connection — “Employee and Skill Proficiency” and “Employee and Language Proficiency”. You cannot have “Skill Code and Language Proficiency”, and “Language Code and Skill Proficiency” relations, they are not related to each other.

If we fill such a table with data, we will write “Null” values.

To correct, we write the following. There is no need to write “Null”.

5th Normal Form — Compound key does not contain cyclic dependencies.

12. Partitioning — division into independent units. There are 2 ways in relational base, horizontal and vertical. An example of vertical segmentation is normalization, when instead of having 5 columns in 1 table, we have 2 or more tables which have less than 5 columns. The horizontal division is called “Sharding”. Consider the picture.

Sharding — horizontal layering, so that the newly obtained pieces hold the base scheme and columns. A spreadsheet or document is fragmented because it is difficult to work with a large chunk, and small chunks can be stored in different databases or on a server, processed in parallel with different processors. Improves indexing and searching. Sharding Key can be used for partitioning transaction. Elements with the same key become inseparable units, change together, form a logical unit — “Logical Shard”.

13. CRUD — Basic “Create”, “Read”, “Update” and “Delete” operations to manage the database.

With the “CREATE” and “INSERT” commands are used to create table and insert data, then we read the data.

The table of books contains ID, title, ISBM, and names of authors and publisher. Not all books have 2 authors and a publisher.

The first part of the following code adds a new column for patronymics in the USER table, changes the default NULL values, and then deletes the same column. The lines that contain purchases made before 2019 are deleted. In the end, the table and database are deleted with the command “DROP”.

When adding a column, we have the following image.

14. Indexing — development of queries that read data faster, creation of new columns (1). They are close to the keys in meaning. Just as a key is a attribute that describes data, so an index is an additional attribute. It is clear that to save the additional feature a memory may be required. If we know the index of a particular line, we will work with that place at once, there will be no need to take the whole table. Let’s announce an index.

In the first part of the code we have to delete all the rows of the table, and check the title. In the second part we accelerate the execution of the query. It should be noted that the advantage of indexing is obvious when used in large tables.

15. View — a logical table that only displays data but is not physically stored in memory (2). Code creates “VIEW”, which will include C# books that have 2 authors, then modify to see all SQL books.

The result of the work of the first part of the code.

The “VIEW” after change.

16. UNION, EXCEPT and INTERSECT — creation of line combinations (3).

17. Transaction — Context of 1 or more operations, work unit (4).

The pseudo-code algorithm is as follows:

1. Start a transaction with BEGIN TRAN,

2. Make queries, data changes,

3. If there is no problem, write Commit to finish,

4. If there is a problem, return to its original position with Rollback.

There are 3 versions.

1. Explicit transaction — starts directly with “BEGIN TRANSACTION” command, ends with “COMMIT TRANSACTION” or “ROLLBACK TRANSACTION” command.

2. Implicit transaction — is like an explicit transaction, works with “COMMIT TRANSACTION” and “ROLLBACK TRANSACTION” commands, but with “SET IMPLICIT_TRANSACTIONS ON” command you have to start a “hidden” transaction, after which the transaction objects will be “locked”. In this case, they continue to be developed using the “Commit” and “Rollback” commands.

3. Auto-commit transaction — a transaction happening from time to time, which is not done explicitly or implicitly. The “SELECT” and “INSERT” commands are an example of an automatic transaction. In case of entering inconsistent data, “Rollback” occurs automatically, and the data is not changed due to an error.

Any transaction must meet 4 conditions called ACID.

1. Atomic — all operations are as inseparable as an atom, either performed together or failing together.

2. Consistent — the result is predictable, before the transaction the base remains stable, infallible.

3. Isolated — ensures that instead of acting on the same data at the same time, data usage is sequenced.

4. Durable — the final result does not depend on the execution process. If there is a complete transaction result, after receiving which the system has failed, the result should not disappear.

Checkpoint — algorithm, which transfers data stored in the transaction log and data changed during the transaction process, to the physical memory. First, the Log files are stored in memory, the raw data from the buffer cache goes to memory, the Checkpoint Log Sequence Number is written to the boot page. There are 4 types — Automatic, Indirect, Manual and Internal. Every change made to the transaction is not immediately transferred from the buffer pool to the memory, but is done according to the “Checkpoint” principle.

Most DDL queries in T-SQL are transactional (5).

18. Stored Procedures — an object stored in the database, can accept parameters, can also return (6).

To run procedure use “EXEC GetProduct”.

19. JOIN — a tool for selecting columns from 2 or more tables (7). There are many mathematical models, but in T-SQL there are 5 logical types — Inner Join, Left Outer, Join Right, Outer Join, Full Outer Join and Cross Join. Main appearance․

SELECT

FROM Table A

JOIN Table B

ON …

WHERE …

First we create tables to store the data of the Buyer, Product and Order.

After filling data using INSERT INTO command, we get Buyer, Product and Order tables with command SELECT.

Using the Order table, we can find out who bought what.

We take the name, surname, city from the buyer’s table. We select the product name from the product table. But we have to combine the Buyer’s table with the Product and Order tables. The result is ordered alphabetically.

20. Triggers — a stored procedure running after an event (8). DML Triggers work after specific events like Insert, Delete and Update. They are compiled in advance and start working automatically. A trigger is declared, the event is defined, the body is written.

Lines with IDs bigger than 16 are deleted from the order table, but the deletion activates the triggers that adds 2 new orders. When looking at the ID column, we make sure that lines are deleted first, then added again.

We can also create JOIN inside the trigger’s body.

As soon as we try to do any INSERT, we get the text “JOIN was done instead of INSERT” and JOIN table.

21․ Schema — the set of attributes of a table or relation and its elements. It is defined by the programming language code, like DDL (9), but this term more often means graphic representation. The star schema is often used during denormalization (10). Is the relationship of the tables or other units in the database, hierarchy and blueprint. Is stored and described in dictionary, as metadata.

It should be understood that the design of databases is based on the scientific researches and theoretical concepts. Each company can provide his own database architecture for certain business needs.

Resources used

1) Learn SQL Database Programming, Query and manipulate DBs from popular relational DB servers using SQL, Chapter 4

2) SQL Pocket Guide, A Guide to SQL Usage, 4th Ed., Chapter 5

3) T-SQL Querying, Developer Reference, Chapter 3

4) Beginning T-SQL with Microsoft SQL Server 2005 and 2008, Chapter 10

5) Exam Ref 70–761 Querying Data with Transact-SQL, Chapter 3.2

6) Microsoft SQL Server 2019, A Beginner’s Guide, 7th Ed. Chapter 8

7) Beginning T-SQL, A Step-by-Step Approach, 4th Ed. Chapter 5

8) Beginning T-SQL, A Step-by-Step Approach, 4th Ed. Chapter 5, Chapter 13

9) Learn SQL Database Programming, Query and manipulate DBs from popular relational DB servers using SQL, Chapter 1

10) Database Design and Relational Theory, Normal Forms and All That Jazz, 2nd Ed., Chapter 8

Data Lake vs. Data Warehouse — https://www.talend.com/resources/data-lake-vs-data-warehouse/

Data Lake and Data Warehouse — https://www.ibm.com/cloud/blog/cloud-data-lake-vs-data-warehouse-vs-data-mart

Qualitative and Quantitative data — https://www.g2.com/articles/qualitative-vs-quantitative-data

Structured and Unstructured data — https://www.ibm.com/cloud/blog/structured-vs-unstructured-data

Types of transactions — https://www.sqlshack.com/modes-of-transactions-in-sql-server/

Contact me on LinkedIn, Instagram, TikTok and Facebook.

--

--