Introduction to Databases

Research Computing Introduction to Databases Presented by Yun Shen ([email protected]) Before We start Sign In Sheet We prepared sign-in sheet for each one to sign We do this for internal management and quality control So please SIGN IN if you havent done so About Me Join RCS March 2016 long time programmer, dated back in 1987 Proficient in C/C++/Perl Domain knowledge: Network/Communication, Databases, Bioinformatics, System Integration.

Contact: [email protected], 617-638-5851 Main Office: 801 Mass Ave. 4th Floor (Crosstown Building) Share a bit about You Name Experience in Database? If so, which database(s) ? Self rating? Is your work data intensive? Whats your current solution? Account on SCC? Motivation (Expectation) to attend this tutorial Any other questions/fun facts you would like the class to know? Evaluation One last piece of information before we start: DONT FORGET TO GO TO: http://rcs.bu.edu/survey/tutorial_evaluation.html Leave your feedback for this tutorial (both good and bad as long as it is honest are welcome. Thank you)

Research Computing Introduction What is Database Key Concepts Typical Applications and Demo Lastest Trends Research Computing Introduction What is Database Key Concepts Typical Applications and Demo Lastest Trends Research Computing What is Database Three levels to view: Level 1: literal meaning the place where data is stored Database = Data + Base, the actual storage of all the information that are interested

Level 2: Database Management System (DBMS) The software tool package that helps gatekeeper and manage data storage, access and maintenances. It can be either in personal usage scope (MS Access, SQLite) or enterprise level scope (Oracle, MySQL, MS SQL, etc). Level 3: Database Application All the possible applications built upon the data stored in databases (web site, BI application, ERP etc). Research Computing DB at each level examples Level 1: data collection text files in certain format: such as many bioinformatic databases the actual data files of databases that stored through certain DBMS, i.e. MySQL, SQL server, Oracle, Postgresql, etc. Level 2: Database Management (DBMS) SQL Server, Oracle, MySQL, SQLite, MS Access, etc. Level 3: Database Application Web/Mobile/Desktop standalone application - e-commerce, online banking, online registration, etc.

Research Computing Examples at each level Level 1: data collection text files in certain format: such as many bioinformatic databases the actual data files of databases that stored through certain DBMS, i.e. MySQL, SQL server, Oracle, Postgresql, etc. Level 2: Database Management System (DBMS) SQL Server, Oracle, MySQL, SQLite, MS Access, etc. Level 3: Database Application Web/Mobile/Desktop standalone application - e-commerce, online banking, online registration, Wikipedia, etc. Research Computing Database Types Flat Model Navigational databases Hierarchical (tree) database model Network/Graph model

Relational Model Object model Document model Entityattributevalue model Star schema Research Computing Typical Database Application Architecture app DBMS DB Research Computing Data File Structure Demo : Take a look at the following file directories: MySQL -- C:\ProgramData\MySQL\MySQL Server 8.0\Data\ Access -- C:\ARCS_dbtutorial\db\access\ Postgresql -- /project/scv/examples/db/tutorial/data/postgresql/testdb/

Research Computing Data File Structure - MySQL Research Computing Data File Structure - Access Research Computing Data File Structure - PostgreSQL Research Computing ATTENTION !!! NO database files can be accessed directly, but only through the database engine, called DBMS Research Computing Typical Database Application Architecture

app DBMS DB Three Common Acronyms SQL Structured Query Language CRUD Create, Read, Update, Delete ACID Atomicity, Concurrency, Integrity and Durability (transaction) Research Computing Disadvantage of conventional flat file Redundancy - same data may store in many different copies Inconsistency data regarding to same business entity may appear in different forms, for example, state name, phone number, etc. This made it hard to modify data and keep clean track of change; even loss of data Mixture of all data together not clear in logical relationships between the data columns, thus hard to understand and manage once the data structure gets complex Hard to maintain and manage No concurrency support (only one can operate on the file) Research Computing

First Acronym - ACID Atomicity transactions are either all or none (commit/rollback) Consistency only valid data is saved Isolation transactions would not affect each other Durability written data will not be lost Good example : bank transaction Most of challenges of ACID compliance come from multiple users/concurrent using of database Research Computing How Databases solves the problem? Self-describing data collection of related records (meta data, data about data) , detail explanation as below: - Self-describing means: Database not just contains data, but also contains definition of the structure of data, that can be considered Meta data; It includes many related info : table column definition, index and key info, constraints, etc At database application level, database can also store other application related meta data as well, it makes personalization and customization of the application according to user profile much easier to handle. The typical example could be the user preference for those common social media sites or e-commerce sites, etc.

Research Computing Database Content Typical Database User Data: tables to store user data Meta data: keep the structure (schema) of the data, including table name, column name and type and contraints over the column(s) User data Medadata Application metadata Index and other overhead Application meta data: application specific meta data regarding to user settings or functions of the application

Index and other overhead data: used for improving performance and maintenance, such as logs, track, security, etc. Research Computing Terminology and Concept Tables (Relations) One very central concept of relational databases is : Tables (Relations ) Table (formally called relation) is the building block of relational database. It stores data in 2D, with its row reflects one instance of record (tuple), and each of its column reflects one aspect of the attributes of all instances, column may also be called field. For example, A student table may contains (student id, first name, last name, grade, school name, home address, ), and each row may represent one students information, and each column of the table represents one piece of information of all students. And this is called a relation. Research Computing Terminology and Concept Relationships the other very central concepts of relational databases is : Relationships In database, a relationship describes the association between two entities (relations). The relationship between the two can be 1:1, 1:M, or M:N.

Relation Relationship Research Computing Terminolgy and Concept - Primary Key and Foreign Key Primary key: Unique Identifier made of one or more columns to uniquely identify rows in a table. If the primary key contains more than one column, it can be called composite key as well. Foreign Key: is the primary key of another table, which is referenced in the current table. Its the key to establish the relationship between the two tables, and through DBMS, to impose referential integrity. Research Computing Terminolgy and Concept - Surrogate Key Surrogate key is a unique column added to a relation to use as the primary key when lack of natural column serves as primary key, or when composite key needs to be replaced for various reasons. Surrogate key is usually in form of auto increment numeric value, and of no meaning to the user, and thus cd often hidden in the table, or form or other entity for the internal use. Surrogate keys are often used in the place of composite key to add more flexibility to the table.

Research Computing Terminology and Concept E-R model E-R Model: Entity-Relationship data model is the common technique used in database design. It captures the relationships between database tables and represent them in a graphical way. The relationships between two entities can be 1:1, 1:N, or M:N. And it is usually established through foreign key constraint. Examples: 1:1 Employee Locker 1:N Customer Order, Order Order Detail M:N Student Course Research Computing Sample E-R diagram #1 Research Computing Sample E-R diagram #2

Research Computing Sample E-R diagram #3 Research Computing One more concept - Normalization Wikipedia definition: Database normalization, or simply normalization, is the process of restructuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd as an integral part of his relational model. Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints. It is accomplished by applying some formal rules either by a process of synthesis (creating a new database design) or decomposition (improving an existing database design). Research Computing Unnormalized Form (UNF) same attributes can be contained in one row

Example: Students take courses Id Name Courses 1. Jack Mathematics Chemistry 2. Tim Chemistry 3. Ana

Physics Chemistry Id or Name Course1 1. Jack Mathematics 2. Tim Chemistry 3.

Ana Physics Course2 Chemistry Chemistry Research Computing First Normal Form (1NF) each attributes can only have one single value in one row; no duplicated row; no row/column order Example: Students take courses Id Name Course 1.

Jack Mathematics 4. Jack Chemistry 2. Tim Chemistry 3. Ana Physics 5.

Ana Chemistry Research Computing Second Normal Form (2NF) 1NF + no partial dependency (non-key attributes may not depend on any of candidate keys) Example: Students take courses Id Name Course CourseID Department 1. Jack

Mathematics M-1 Math 4. Jack Chemistry C-1 Chemistry 2. Tim Chemistry C-1

Chemistry 3. Ana Physics P-1 Physics 5. Ana Chemistry C-1 Chemistry Research Computing

Second Normal Form (2NF) continue 1NF + no partial dependency (non-key attributes may depend on any of candidate keys) (Course) = Func(CourseID) Id Name CourseID 1. Jack 4. (Department) = Func(CourseID) M-1 Course Course ID Department

Jack C-1 M-1 Mathematics Math 2. Tim C-1 C-1 Chemistry Chemistry 3.

Ana P-1 P-1 Physics Physics 5. Ana C-1 Research Computing Third Normal Form (3NF) 2NF + no transitive dependency (non-key attributes may not depend on each other) Example: Students take courses Id

Name Course CourseID Department Building_no building 1. Jack Mathematics M-1 Math 31 D 4. Jack

Chemistry C-1 Chemistry 22 A 2. Tim Chemistry C-1 Chemistry 22 A

3. Ana Physics P-1 Physics 18 S 5. Ana Chemistry C-1 Chemistry

22 A Research Computing Third Normal Form (3NF) continue very similar to 2NF, but more strict; no functional dependency between non-key attributes at all. Id Name CourseID Building_no 1. Jack M-1 31 4. Jack

C-1 22 2. Tim C-1 22 3. Ana P-1 18 5. Ana

C-1 22 Course Course ID Department M-1 Mathematics Math C-1 Chemistry Chemistry P-1

Physics Physics Building_no Building 18 S 22 A 31 D Research Computing BoyceCodd Normal Form (BCNF) 3NF + non dependency between all candidate keys Example: 3NF, but not BCNF - Today's Court Bookings

Court Start Time End Time Rate Type 1 09:30 10:30 SAVER 1 11:00 12:00 SAVER

1 14:00 15:30 STANDARD 2 10:00 11:30 PREMIUM-B 2 11:30 13:30 PREMIUM-B

2 15:00 16:30 PREMIUM-A Research Computing BoyceCodd Normal Form (BCNF) 3NF + non dependency between all candidate keys Example: convert to BCNF - Today's Court Bookings Today's Bookings Rate Type Rate Type Court Member Flag

Member Flag Court Start Time End Time SAVER 1 Yes Yes 1 09:30 10:30 STANDARD

1 No Yes 1 11:00 12:00 PREMIUM-A 2 Yes No 1 14:00

15:30 PREMIUM-B 2 No No 2 10:00 11:30 No 2 11:30 13:30

Yes 2 15:00 16:30 Research Computing One More Example database anomalies My_Anomly_ex_bank_orig AccountNo Name Address AccountBal AccountRate 100001 Lee, Yuan

11 Main Street 23994.58 4.50 100002 Lee, Yuan 11 Main Street 100.74 0.00 100002 Tian, Fu 12 Main Street 100.74 0.00 100003 Wang, Michele

12 Main Street 2500.8 0.00 100004 Dong, Yuan 14 Main Street 32003.98 4.50 100052 Yuan, Ben 16 Main Street 37.38 0.00 Research Computing

One More Example My_Anomly_ex_bank_orig AccountNo Name Address AccountBal AccountRate 100001 Lee, Yuan 11 Main Street 23994.58 4.50 100002 Lee, Yuan 11 Main Street

100.74 0.00 100002 Tian, Fu 12 Main Street 100.74 0.00 100003 Wang, Michele 12 Main Street 2500.8 0.00 100004 Dong, Yuan 14 Main Street

32003.98 4.50 100052 Yuan, Ben 16 Main Street 37.38 0.00 Research Computing One More Example My_Anomly_ex_bank_orig AccountNo Name Address AccountBal

AccountRate 100001 Lee, Yuan 11 Main Street 23994.58 4.50 100002 Lee, Yuan 11 Main Street 100.74 0.00 100002 Tian, Fu 12 Main Street 100.74

0.00 100003 Wang, Michele 12 Main Street 2500.8 0.00 100004 Dong, Yuan 14 Main Street 32003.98 4.50 100052 Yuan, Ben 16 Main Street 37.38

0.00 Research Computing One More Example My_Anomly_ex_bank_orig AccountNo Name Address AccountBal AccountRate 100001 Lee, Yuan 11 Main Street 23994.58 4.50

100002 Lee, Yuan 11 Main Street 100.74 0.00 100002 Tian, Fu 12 Main Street 100.74 0.00 100003 Wang, Michele 12 Main Street 2500.8 0.00

100004 Dong, Yuan 14 Main Street 32003.98 4.50 100052 Yuan, Ben 16 Main Street 37.38 0.00 Research Computing One More Example My_Anomly_ex_bank_orig AccountNo Name

Address AccountBal AccountRate 100001 Lee, Yuan 11 Main Street 23994.58 4.50 100002 Lee, Yuan 11 Main Street 100.74 0.00 100002 Tian, Fu

12 Main Street 100.74 0.00 100003 Wang, Michele 12 Main Street 2500.8 0.00 100004 Dong, Yuan 14 Main Street 32003.98 4.50 100052 Yuan, Ben

16 Main Street 37.38 0.00 Gu, Zhen (???) Research Computing Advantages of Normalization BCNF+ normalization can eliminate all anomalies : No Redundancy No Inconsistency all changes can only be made at the same place and keep consistent (because of the key constraints), in DB terminology get away with all update anormaly. Normalization is the process of decomposition, so all the business concepts can be modeled with clear logical relationships The entire database system remains consistent over time as the database grows with least redundancy and much durability. Strong support to be ACID compliant Research Computing

Advantages of Normalization No(less) data redundancy means easy management, less storage, etc. No headache caused by data operation anomalies. Good for data integrity and consistency. Research Computing Disadvantages of Normalization Take effort May increase complexity in data structure Data retrieving efficiency may be discounted due to the need of join of multiple tables; So may not be proper in read-intensive data applications Sometimes the constraints may be too strict to be flexible to make some customized change needed. Research Computing Disadvantages of Normalization Hard to deal with complex data structures such as class, objects, rows in a field. Query for comprehensive information can be costly.[6] Due to fixed predesigned structure, it is not flexible in terms of

restructure of data Research Computing Modern applications Today companies like Google, Amazon and Facebook deal with loads of data and storing that data in an efficient manner is always a big task. They use NoSQL database which is based on the principles of unnormalized relational model to deal with storage issue.Some of the examples of NoSQL databases are MongoDB, Apache Cassandra and Redis. These databases are more scalable and easy to query with as they do not involve expensive operations like JOIN. Research Computing Denormalization Trade-off Normalization and denormalization both have advantages and disavantages. The best practice is always a trade off between the two. Denormalization will increase the risk of loss of data integrity and the size of storage, but may gain the simplicity and intuitivity of presenting data. Research Computing

Denormalization - Example Customer (CustomerID, Name, Address, Zip, City, State) CustomerID [PK] Zip [PK] CustomerID [PK] Name City Name Address State Address Zip [FK] Zip City

State Research Computing Denormalization - Example This is the normalized table design CustomerID Name Address Zip 101 John 111 Main St 02132 102 Adam

17 Willow St 02451 103 Grace 333 Burke St 02132 Zip City State 02132 Boston MA

02451 Waltham MA Research Computing Denormalization - Example This is the denormalized table design CustomID Name Address Zip City State 101

John 111 Main St 02132 Boston MA 102 Adam 17 Willow St 02451 Waltham MA 103

Grace 333 Burke St 02132 Boston MA Research Computing Database Operation CRUD four basic operations Create : generate (create/store) a record of data Read : retrieve stored data Update : change (update) certain properties (attributes) of data Delete) : complete remove (delete) the record of data There are more other operations Research Computing Tools to Access Databases

SQL (Structured Query Language) Sublanguage (not full language) Good only at dealing with structured (predefined, organized) data DDL (Data definition Language) - Create/Modify tables/columns DQL (Data Query Language) - Select DML (Data Manipulate Language) - Insert/update/delete DCL (Data Control Language) User Access Control, permissions, roles Research Computing How to Learn SQL Same approach as to learn many other languages, get a language reference first https://www.w3schools.com/sql/sql_syntax.asp (Standard SQL, simple) https:// docs.oracle.com/cd/B28359_01/server.111/b28286/expressions.htm#SQLRF004 (complete, but more complicated, and may add its own flavor) https://dev.mysql.com/doc/refman/5.7/en/sql-syntax.html (MySQL SQL statements) Start from basics (DDL, DML, DQL) Extend to more complicate elements (Subquery/Join/Function/Store Procedure/Index/Programming) Research Computing

Demos and Exercises Microsoft Access Templates E-R schema in Microsoft Access Data import/export via Microsoft Access (may do some real life demo using actual data file from audience) Data import/export via MySQL Online practice: https:// www.tutorialspoint.com/execute_sql_online.php Research Computing Other Topics not covered by this tutorial Database Administration: concurrency, security, backup, recovery, and many more Database Performance tuning: indexing, server configurations Database programming Database Technology Trend: BigData challenge, Data Warehouse, BI system, NoSQL, Cloud, Hadoop/Spark, etc. Research Computing BI Systems Reporting System

Data Mining (Has big overlap with todays ML/AI trend) Data Warehouse/Data Mart ETL (Extract/Transform/Load) Research Computing Big Data 4Vs: Volume how big in storage is need? Variety how diverse is the data ? Veracity can data be verified/trusted? Velocity how fast is the data being generated? Research Computing Big Data Technologies Predictive analytics: to discover, evaluate, optimize, and deploy predictive models by analyzing big data sources to improve business performance or mitigate risk. NoSQL databases: key-value, document, and graph databases. Search and knowledge discovery: tools and technologies to support self-service extraction of information and new insights from large repositories of unstructured and structured data that resides in multiple sources such as file systems, databases, streams, APIs, and other platforms and applications. Stream analytics: software that can filter, aggregate, enrich, and analyze a high throughput of data from multiple disparate live data sources and in any data format. In-memory data fabric: provides low-latency access and processing of large quantities of data by distributing data across the dynamic random access

memory (DRAM), Flash, or SSD of a distributed computer system. Distributed file stores: a computer network where data is stored on more than one node, often in a replicated fashion, for redundancy and performance. Data virtualization: a technology that delivers information from various data sources, including big data sources such as Hadoop and distributed data stores in real-time and near-real time. Data integration: tools for data orchestration across solutions such as Amazon Elastic MapReduce (EMR), Apache Hive, Apache Pig, Apache Spark, MapReduce, Couchbase, Hadoop, and MongoDB. Data preparation: software that eases the burden of sourcing, shaping, cleansing, and sharing diverse and messy data sets to accelerate datas usefulness for analytics. Data quality: products that conduct data cleansing and enrichment on large, high-velocity data sets, using parallel operations on distributed data stores and databases. Evaluation One last piece of information before we start: DONT FORGET TO GO TO: http://rcs.bu.edu/survey/tutorial_evaluation.html Leave your feedback for this tutorial (both good and bad as long as it is honest are welcome. Thank you) Thank you !

Recently Viewed Presentations

  • Ça fait peur aux oiseaux - gfol1.imagileonation.com

    Ça fait peur aux oiseaux - gfol1.imagileonation.com

    Ça fait peur aux oiseaux La Bonne Chanson Répertoire Français Au Québec 1925 - 1955 Ne parlez pas tant Lisandre Quand nous tendons nos filets Les oiseaux vont vous entendre Et s'enfuiront des bosquets Aimez-moi sans me le dire Aimez-moi...
  • Homeostasis and Body Organization

    Homeostasis and Body Organization

    Capillaries - Microscopic vessels that allow blood and body to exchange gas, nutrients, and waste Veins and Venules - Carry blood back to the heart Valves direct flow of blood in veins Change in blood velocity as a function of...
  • New Ambulance Operations Centre Make Ready: Polegate James

    New Ambulance Operations Centre Make Ready: Polegate James

    Currently seeking planning permission Approximate 12 month build time Local Impact MRC home for approx 130 operational A&E staff, approx 26 PTS staff and a make ready team Principle staff shift changeover times approx 0530-1000 and 1730-2000 Ambulance vehicles deploy...
  • JRN 490 Peace Journalism Lesson 1: Introduction to Peace ...

    JRN 490 Peace Journalism Lesson 1: Introduction to Peace ...

    According to Peter du Toit [2000]: "Conflict exists in a relationship when two or more parties believe their aspirations cannot be achieved simultaneously, or perceive a divergence in their values, needs or interests and purposefully employ their power in an...
  • Chapter 8

    Chapter 8

    The company begins 2016 with merchandise inventory of $120,000 on hand. During 2016, additional merchandise was purchased on account at a cost of $600,000. Sales for the year, all on account, totaled $820,000. Lothridge uses a periodic inventory system. A...
  • Infections in diabetes Dr Priscilla Rupali Dept of

    Infections in diabetes Dr Priscilla Rupali Dept of

    Infections in diabetes Dr Priscilla Rupali Dept of Medicine-I &I.D CMCH Vellore Worrying symptoms Foul smelling vaginal discharge Dysuria, painful urination Fever Painful swallowing Changes in bowel habits Warmth or redness at the site of a scrape or cut Or...
  • Mrs. Oberdick's Class

    Mrs. Oberdick's Class

    I went to Broken Arrow Public Schools starting in the 4th grade. It is where I graduated high school. After High School I returned to BA where I taught Sophomore English for 3 years, Junior English for a year and...
  • Title

    Title

    38 mbb/h. Medium-scale experiments on subgame solvingwithin action abstraction. Small Game Exploitability. Large Game Exploitability. Blueprint Strategy. 91.3 mbb / hand. 41.4 mbb / hand. Unsafe Subgame Solving. 5.51 mbb / hand. 397 mbb / hand. Re-solving. 54.1 mbb /...