Sql Data Science Questions Answers Er Diagram

Please discuss the three types of data models in an entity-relationship (ER) model.

The Conceptual Model

This model is used to describe the conceptual structure of the business data strategy in the initial stage of design. It defines entity names and entity relationships but does not focus on technical problems, computer systems, or database management systems.

The Logical Model

This model is used for the implementation of the database management system and it focus on the logical structure of the data. It is a staging model and it can be both user- and system-oriented. In addition to entity names and entity relationships, the model defines attributes, primary keys, and foreign keys in each entity.

The Physical Data Model

This is the final stage of a data model which not only relates to a specific database management system, but also states the operating system, storage strategy, data security, and hardware. This model describes schema details, columns, data types, constraints, triggers, indexes, replicas, and backup strategy. This model carries the actual design blueprint for the database implementation by DBA and database developers.

What is normalization? What is denormalization?

Normalization is the process of reducing data redundancy while maintaining data integrity. This is performed by creating relationships among tables through primary and foreign keys. Normalization procedures include 1NF (first normal form), 2NF, 3NF, etc.

1NF: Eliminate repeating groups of attributes into entities and implement a primary key for each entity. (By definition, each cell is identified by the combination of its unique primary key and a column identifier.)

For example, in this model of university classes, each record entry is uniquely identified by the primary key [Professor_Id]+[Course_Id].

A 1NF model with columns for professor ID, name, sex, type ID/name, and department; course ID and title; term, and campus.

2NF: This includes 1NF, but all non-key attributes (i.e., those other than the primary key columns) must depend on the primary key. If not, separate such attributes into another entity. You are solving the question of whether relationships should be one-to-many or many-to-many at this stage. (See example below.)

To extend our 1NF example to 2NF, we include "verb relationships" but not "adjective relationships." The "Course" column partially depends on the primary key [Professor_Id]+[Course_Id]+[Term]. Therefore, we create the entities "Class" and "Professor-Class" to model the many-to-many relationship.

The same model extended to 2NF, using separate "Professor-Class" and "Class" tables.

3NF: This includes 2NF—plus, all fields (columns) cannot have transitive functional dependency (an indirect relationship between columns) from primary key in the table. In the above 1NF example, if we know the [Course+Term] we learn the [Professor_Name]. When we learn the [Professor_Name] we have the [Dept_Name]. Therefore [Course+Term] has transitive functional dependency on Department.

At this stage, we are creating reference tables with foreign keys—similar to what we did with 2NF, but for "adjective relationships" as well.

Continuing with our example, [Type], [Dept_Id], [Dept_Name] and [Campus] have transitive functional dependency. They do not depend on the primary key [Professor_Id], so we create two reference tables:

Extending the model to 3NF by adding Type and Department tables.

In most cases, the 3NF stage is sufficient for normalization.

Denormalization, on the other hand, is an optimized procedure used on a normalized database to increase query performance for data analytics. It's a process to revert from 3NF (or a higher form) to 1NF. A denormalized model is also sometimes called a dimensional model. It decreases the running time of selected queries by eliminating complex joins, it pre-aggregates detailed data into summary-level data, and it stores data in a separate denormalized 1NF using in-memory and column-stored technology (c.f. the related question on this page.) That is, the original 3NF model remains in place as the original source of data.

You'll find a variation of this strategy within any data warehouse dimensional model design strategy.

Please discuss in-memory analytics and column store databases in the context of data warehouse architecture.

In-memory analytics is a method of caching the entire database in RAM. In the past, RAM has been expensive, so this method was often cost prohibitive. The solution at the time was to use data indexing and store pre-aggregated data.

The slowest part of that process was retrieving data from disk, because limited system memory was not capable of holding an entire database. Today, the price of RAM is more affordable, so the industry now has a much higher adoption rate when it comes to in-memory analytics.

A column store is a modern way of storing and retrieving data in a database. In the past, data was row-oriented and stored horizontally. In order to retrieve the data, it was read horizontally, like reading a book from left to right. So the entire data set would have to be read from the hard drive into RAM before it could be used.

Nowadays, column store (or columnar store) databases use column-oriented storage which stores the data vertically. Thus there are minimal hard drive reads when retrieving data—the database only reads the columns that a query actually contains.

For example, say you have a "Customer Sales" dataset that has 20 columns of all the customers' personal and sales information, with a million rows. You only need the [Customer Name], [Address], and [Purchase Date] columns. Using a column store, the data will only be read from three columns with the least amount of hard disk reads, drastically reducing read times.

Also, column storage improves the likelihood of compression, because the data values within a single field are highly similar to one another—something that compression algorithms depend on.

Apply to Join Toptal's Development Network

and enjoy reliable, steady, remote Freelance Data Modeling Expert Jobs.

Apply as a Freelancer

Please discuss the issues/disadvantages with this dimensional model:

Instead of expanding charge line items horizontally, the model should be split into a FACT_Header and a Fact_Detail table. The charge line item column names should be pivoted into one "Charge_Code" column as a dimension for in-memory/column storage best practices.

Separate tables for Fact_Header (with Shipment_No, Date, Product, FCLLCL, and Port_to_Port columns) and Fact_Detail (with Shipment_No, Charge_Code, and Amount columns).

Please discuss the different types of NoSQL databases.

Document Databases

A document database is mostly used for semi-structured data storage in a self-describing document format like JSON. Data document structures—defined by key-value hierarchies—can either be identical or different. Documents with the same attributes will be grouped into "collections" which is similar to tables in a relational databases.

              {     "id": "IC001",     "Category": "Ice Cream",     "Product Name": "Celebration Cookie Cake",     "Size": 2.5,     "Details": {         "shape": [             { "id": "001", "type": "Round" },             { "id": "002", "type": "Sheet" },             { "id": "003", "type": "Roll" },             { "id": "004", "type": "Dome" }         ],	         "cream flavor": [             { "id": "1001", "type": "Mint" },             { "id": "1002", "type": "Oreo" },             { "id": "1045", "type": "Chocolate" },             { "id": "1021", "type": "Cheese Cake" },             { "id": "1022", "type": "Vanilla" },             { "id": "1033", "type": "Cookie" },             { "id": "1014", "type": "Butter Pecan" }         ]     },     "Image": "X562FCDQX73DS;L9EBWIDKFDKNLUWNJSNA:IU*DWJHIJDSJDKALijnkw8!WEJD#KLJSATEGD..." }

Column Databases

In a column database, data is stored column-oriented instead of row-oriented. This type of database is optimized for I/O during analytical database queries. Some examples of NoSQL column databases include Google's Big Table, Apache Cassandra, and HBase.

Key-value Databases

As with document databases, a key-value data structure is like a dictionary or a map, but in this case it cannot be nested. It uses a unique key—which can be synthetic or hash-algorithm-generated—to point to its own value list. The value can be of different types: string, JSON, basic large object (BLOB), etc.

              {     name: "Jimmy Johnson",     email: "jjohnson@test.com",     street: "13278 Southside Blvd",     city: "Orlando",     state:  "FL",     country: "USA" }

Cache Systems

Similar to key-value pairs, frequently accessed data is stored in memory for fast retrieval and access. Redis and clustered Memcached are examples of this. (While Memcached can be used by RDBMSes, it can only offer memory in one server, whereas NoSQL offers memory farms from all servers in a cluster.)

Graph Databases

In a graph database, objects are represented as nodes and relationships between objects are represented as edges. Both edges and nodes have defined attributes/properties which can be stored as key-value pairs. Examples include Neo4J, InfiniteGraph, OrientDB, Virtuoso, and Allegro.

Note: For more information please see The Definitive Guide to NoSQL Databases on the Toptal Engineering Blog.

What are the advantages of NoSQL databases over relational databases?

Flexibility

A relational database only allows structured data, whereas a NoSQL database offers the flexibility to store structured, semi-structured, or unstructured data.

Dynamic Schema

In a relational data model, the schema needs to be predefined. In the real world, applications evolve and changes are needed. This process is very costly and time-consuming. In recent decades, the data object is more complicated compared to the days of simple text and numbers. For example, geospatial data (polygons made out of GPS coordinates) can be difficult to query efficiently with some RDBMSes.

A dynamic schema offers a solution to add complex data objects easily because you can modify or add new elements as your application grows without a predefined schema.

Large Blobs/Rich Media

In many contexts nowadays, we need to have a more flexible way to store data types like voice files, video files, images, etc. Where once it was considered a best practice to simply store links to a filesystem for data like this, with NoSQL they can be stored directly.

The advantage—aside from avoiding brittle links—is that the database breaks the objects into pieces and can distribute them over a server pool for the ultimate performance gain.

Sharding

Sharding is the process of dividing up and distributing data into smaller databases for faster access to the data. The data is spread across a number of servers without the application being aware. The data and query load are evenly distributed across the servers. That way, if one goes down, it can be replaced effortlessly and with no application disruption.

Compared to a single-server setup, this process reduces big data tasks into small pieces so that it they can be processed at the same time in a distributed server farm.

Replication

NoSQL databases are quite sophisticated when it comes to being self-resolving and self-balancing. They offer failover and recovery options. They do this by distributing the database over many geographical/regional data centers. If one region goes down, it automatically relies on the others. This architecture allows for uninterrupted ("high") availability, because data is replicated across different data centers.

Scaling

Starting with a distributed set of server nodes (all server nodes are peers with no shared resources and can work independently), as the database size increases, additional server nodes can be added on the fly across the data center without interruption or limitation.

In current RDBMS architectures, you can generally only scale up by adding memory and hard drives within the applicable hardware limitations, although native sharding support is in the works in some cases.

How do you handle one-to-many and many-to-many relationships in NoSQL data modeling?

One-to-many

One-to-many relationships in NoSQL are modeled in a document database and the 1:M relationship is represented as a parent-child relationship-embedded entity object within another document data type. For example:

              {     "order_id": "12345",     "order_date": "1/21/2019",     "customer_id": "123456",     "order_details": [         {             "item_id": "2345",             "qty": 2,             "unit_price": 157.75,             "sales_price": 315.5         },         {             "item_id": "2110",             "qty": 1,             "unit_price": 75.25,             "sales_price": 75.25         },         {             "item_id": "1760",             "qty": 3,             "unit_price": 55,             "sales_price": 165         }     ] }

Here, each of the objects under order_details is a child document of a different document data type.

Many-to-many

Many-to-many relationships are modeled using two document collections with embedded entity objects referencing related documents from a third collection with identifiers/keys. The M:N relationships are achieved/maintained at the application level when retrieving data. The following example uses the identifier class_id to resolve M:N relationships between professor and student collections.

The professor collection looks like this:

              {     "Professor_id": "1021",     "Name": "John P Vess",     "Type": "Adjunct",     "class_details": [         {             "class_id": [                 "B10101",                 "B10203",                 "B10112"             ]         }     ] }, ...

And the student collection looks like this:

              {     "Student_id": "201727542",     "Name": "Mary Carson",     "Type": "Under Graduate",     "class_details": [         {             "class_id": [                 "B10101",                 "B10107",                 "B10119"             ]         }     ] }, {     "Student_id": "201821230",     "Name": "Jerry Smith",     "Type": "Under Graduate",     "class_details": [         {             "class_id": [                 "B10101",                 "B10203",                 "M10332",                 "C12001"             ]         }     ] }, ...

Finally, the class collection, referred to by the other two collections, looks like this:

              {     "Class_id": "B10101",     "Class": "Business Principles" }, {     "Class_id": "B10203",     "Class": "Introduction to Marketing" }, ...

What are the possible cardinalities of a relationship in the entity-relationship (ER) model, and what do they mean?

There are three degrees of relationship, known as:

One-to-one (1:1)

Here, one occurrence in an entity can relate to one occurrence or zero occurrences in another entity.

For example: A "tag" table with a "tag number" attribute might have a 1:1 relationship to a "vehicle" table with a unique vehicle identification number (VIN) as its primary key. That is, a given vehicle may or may not have a tag, but it can't have more than one.

Further to this example, there might be motor vehicle tags that are in inventory, but that have not been assigned a vehicle yet.

One-to-many (1:M)

In this case, one occurrence in an entity relates (or can relate) to many occurrences in another entity.

For example: A professor can teach many classes. Therefore, in the "professor" table, each professor record can be associated with many class records.

One professor can be associated with many classes.

The opposite relationship is many-to-one (M:1)—it's simply a matter of semantics, depending on which table you are mentioning first—e.g., the "class" table would have an M:1 relationship to the "professor" table.

Many-to-many (M:N)

With this type of relationship, many occurrences in an entity can relate to many occurrences in another entity.

E.g., a student can take many classes and a class can contain many students. To implement this relationship, you will need an intermediate table linking students to courses. We can call this table the "student-course" table.

So in total you will need three tables: "student," "class," and "student-class." From "student" to "student-class" is 1:M. From "student-class" to "class" is M:1. Finally, "student" to "class" is M:N using the intermediate table.

Students can be associated with many classes and vice-versa; therefore there's an intermediate student-class table to correctly model the M:N relationship between the two entities.

What is the star schema? When is it used?

The star schema is where there are one or more fact tables referencing any number of dimension tables in a star schema. Usually, the fact tables in a star schema are created from the third normal form (3NF) with foreign keys and aggregates (sometimes called "measures.")

It's used when developing data warehouses and dimensional data marts.

The star schema below shows two fact tables, "Fact_Professor" and "Fact_Professor_Class_Detail," and dimension tables include "Dim_Type," "Dim_Department," "Dim_Date," and "Dim_Class". Foreign keys are [Type_Id], [Dept_Id], and [Date_Key]. Measures (not pictured; these are calculated sums, minimums, maximums, averages, counts, and distinct counts) would include the sum of [Credits], the distinct count of [Professor], the distinct count of [Class_Id], and [Hired_Days] which can be calculated the difference between [Hired_Date] and today().

Star schema example, with tables Dim_Type, Dim_Department, Dim_Class, Fact_professor, Dim_Date, and Fact_Professor_Class_Detail.

Please discuss the best strategy to model a set of entities in a NoSQL database.

Beginning with a small-scale NoSQL database, this model can be built in using a dynamic schema. That is, it will allow schema changes and adjustments in real time, rather than having a predefined schema during the development of an application. However, to ensure data consistency and resolve relationships between data entities, use an entity-relationship (ER) model initially to define entities, relations, attributes, primary keys and foreign keys. The ER model—since it's normalized by definition—will ensure data integrity. The model can then be converted/denormalized to a parent-child relationship-embedded entity model. For example:

Start with a 3NF model with "order_id" as the primary key in the entity "Order" and with foreign keys in the entity "Detail":

Order and Detail tables for a typical 3NF model.

Denormalized, the above 3NF Model then looks like:

The 3NF model denormalized into a single table with order_id, order_date, customer_id, detail_id, item_id, qty, unit_price, and sales_total columns.

Finally, you can convert the denormalized form to a NoSQL parent-child relationship-embedded entity model:

              {     "order_id": "12345",     "order_date": "1/21/2019",     "customer_id": "123456",     "order_details": [         {             "item_id": "2345",             "qty": 2,             "unit_price": 157.75,             "sales_price": 315.5         },         {             "item_id": "2110",             "qty": 1,             "unit_price": 75.25,             "sales_price": 75.25         },         {             "item_id": "1760",             "qty": 3,             "unit_price": 55,             "sales_price": 165         }     ] }

There is more to interviewing than tricky technical questions, so these are intended merely as a guide. Not every "A" candidate worth hiring will be able to answer them all, nor does answering them all guarantee an "A" candidate. At the end of the day, hiring remains an art, a science — and a lot of work.

Submit an interview question

Submitted questions and answers are subject to review and editing, and may or may not be selected for posting, at the sole discretion of Toptal, LLC.

Eva Bojorges Rodriguez

Freelance Data Modeling Expert

Mexico Toptal Member Since October 16, 2014

Eva is a skilled back-end developer and machine learning engineer with experience in scalability issues, system administration, and more. She has a flair for well-structured, readable, and maintainable applications and excellent knowledge of Python, Ruby, and Go. She is a quick learner and has worked in teams of all sizes.

Christopher Karvetski

Freelance Data Modeling Expert

United States Toptal Member Since June 22, 2016

Dr. Karvetski has ten years of experience as a data and decision scientist. He has worked across academia and industry in a variety of team and client settings, and has been recognized as an excellent communicator. He loves working with teams to conceive and deploy novel data science solutions. He has expertise with R, SQL, MATLAB, SAS, and other platforms for data science.

Renee Ahel

Freelance Data Modeling Expert

Croatia Toptal Member Since June 24, 2018

Renee is a data scientist with over 12 years of experience, and five years as a full-stack software engineer. For over 12 years, he has worked in international environments, with English or German as a working language. This includes four years working remotely for German and Austrian client companies and nine months working remotely as a member of the Deutsche Telekom international analytics team.