9/7/2020 Data 620 - Lesson 4: Data Manipulation
9/7/2020 Data 620 - Lesson 4: Data Manipulation and SQL File Input/Output - DATA 620 9040 Data Management and V isualization (2208) https://learn.umgc.edu/d2l/le/content/512831/viewContent/19677149/V iew 1/3 OVERVIEW: The first three weeks of this class introduced R X W R W K H J H Q H U D O D U H D R I G D W D P D Q D J H P H Q W D Q d gave R X D W D V W H R I K R Z W R X V H 6 4 / W R W D O N W R our database. Once R X K D Y H D Q R U P D O L ] H G G D W D E D V H D Q G D V N L O O H d SQL programmer , there are additional areas to consider . This week, we cover data quality, database design and operations, data governance, and data storage and indexing. W e also expand our SQL commands to include those to read data in from outside sources, and to write the results of SQL queries to outside sources: in our case, we use a comma separated values (csv) format. Data Qualit refers to how fit R X U G D W D L V I R U W K H S X U S R V H V ou intend. You could have a beautifull Q R U P D O L ] H d relational database with perfectl X Q L T X H S U L P D U ke V D Q G I O D Z O H V V I R U H L J Q N H relationships, but if all the customers’ names are spelled incorrectly, it will still not be ver X V H I X O I R U P D U N H W L Q J S X U S R V H V 7 K L V Z H H N , we’ll cover the major wa V W K L Q J V F D Q J R Z U R Q J Z L W K our data: it could be inaccurate, out of date, incomplete, or inconsistent with itself. We’ll also cover the major wa V W K L Q J V F D Q J R Z U R Q J Z L W K our database itself: R u can fail to have referential integrity , or R X U G D W D F R X O G I D L O W R I R O O R Z E X V L Q H V V U X O H V ' D W D T X D O L W is such a huge corporate issue that man F R P S D Q L H V K D Y H H Q W L U H W H D P V R I S H R S O H G H G L F D W H G W R H Q V X U L Q J G D W D T X D O L W y . In our stud R I Database Design and Operations this week, we investigate structural problems with databases. Because databases are so precisel V S H F L I L H G D Q G E H F D X V H W K H run on computers, a structural problem can be catastrophic. We have seen in previous weeks how to set up a good database, in which all primar N H s are unique and all foreign ke V O H D G W R V R P H Z K H U H 7 K L V Z H H N Z H Z L O O V W X G integrit F R Q V W U D L Q W V Z K L F K K H O p ensure that the foreign ke V U H I H U W R H [ L V W L Q J G H V W L Q D W L R Q V D Q G W K D W L I ou delete a record, R X G R Q ’ t bring down the database. We then turn our attention awa I U R P W K H Q R U P D O L ] H G U H O D W L R Q D O G D W D E D V H V F K H P D V R R I W H Q X V H G L Q R S H U D W L R Q D l data stores. W e look towards an alternative or ganization called the Snowflake or Star Schema . Rather than several smaller tables linked with ke V W K H V Q R Z I O D N H V F K H P D K D V R Q H F H Q W U D O L ] H G I L O H Z L W K Y D U L R X V ? D U P V ? R r points of the star) attached to it. Each “arm” contains the information R X Q H H G W R T X H U along one dimension. For example, a large flat file could contain the date of a purchase, the purchased item, the shipment destination, and the customer information. One arm of one star could break the dates down into months, quarters, and H D U V Another arm of the star might break the shipment destination down into states (such as Mar O D Q G R U U H J L R Q s (such as the Pacific Northwest). This lets the customer run quick queries for things like “all orders which shipped in Quarter 1 of 2015, to the state of Mar O D Q G ? R U ? D O O R U G H U V X Q G H U Z K L F K V K L S S H G L Q 0 D of any H D r, to an Z K H U H L Q W K H 3 D F L I L F 1 R U W K Z H V W ? 4 X H U L H V R I W K L V W pe are included in a method called Dimensional Anal V L s . Data Governance: Or , the care and feeding of R X U G D W a Data is crucial to modern corporate operations, and H W L W F K D Q J H V D O O W K H W L P H - X V W D V Q H Z E X L O G L Q J F R Q V W U X F W L R n needs to be maintained, and periodicall J L Y H Q D Q H Z U R R I P R V W F R U S R U D W H G D W D Q H H G V W R X Q G H r go regular maintenance, and periodicall J L Y H Q D U H R rganization. Even if nothing went wrong with R X U G D W D S H U L R G L F D O O y there are new laws (such as HIPAA) which require new measures to be taken with existing data. Data governance includes the activities a corporation needs to undertake to identif S U R E O H P V Z L W K L W V G D W D I L [ W K H P , and deplo W K H U H V X O W V Data Storage and Indexing: How do R X I L Q G our data quickl ? With our small databases in this class, it’ s not reall Q H F H V V D U to optimize for quer V S H H G D F R P S X W H U F D n simpl O R R N W K U R X J K H D F K R Q H R I R X U I H Z K X Q G U H G D L U O L Q H U H F R U G V W R I L Q G W K H R Q H Z H Z D Q W % X W Z K D W L I ou have a few million records? Or what if R X D U H ) D F H E R R N D Q G D U H J H Q H U D W L Q J P R U H W K D Q P L O O L R Q ? O L N H V ? H Y H U y minute [1] ? Even if R X U T X H U can search one million rows a minute, if R X D U H ) D F H E R R N ou are getting 4 minutes slower in R X U T X H U L H V I R U H Y H U minute of time which passes. Y ou started off behind and R X D U e slowing down. One solution is to use a data index (plural: several data indices ) to help cut R X U V H D U F K W L P H If R X K D Y H D O L V W R I Q D P H V V R U W H G D O S K D E H W L F D O O y , and the one R X Z D Q W V W D U W V Z L W K D ? 4 ? ou don’ t need to start 9/7/2020 Data 620 - Lesson 4: Data Manipulation and SQL File Input/Output - DATA 620 9040 Data Management and V isualization (2208) https://learn.umgc.edu/d2l/le/content/512831/viewContent/19677149/V iew 2/3 searching at the “A.” Data indexing is a wa W R F R Q V W U X F W D V H U L H V R I P D S V I R U our SQL query , so it can make use of alphabetical, numeric, and other sorted orders to accelerate R X U V H D U F K . SQL File Input/Output: How do R X read and write to external files? Up until now, our databases have magicall O L Y H G Z L W K L Q W K H &