email

9/7/2020 Data 620 - Lesson 4: Data Manipulation

9/7/2020 Data 620 - Lesson 4: Data Manipulation and SQL File Input/Output - DATA 620 9040 Data Management and V isualization (2208) https://learn.umgc.edu/d2l/le/content/512831/viewContent/19677149/V iew 1/3 OVERVIEW: The first three weeks of this class introduced RXWRWKHJHQHUDODUHDRIGDWDPDQDJHPHQWDQd gave RXDWDVWHRIKRZWRXVH64/WRWDONWRour database. Once RXKDYHDQRUPDOL]HGGDWDEDVHDQGDVNLOOHd SQL programmer , there are additional areas to consider . This week, we cover data quality, database design and operations, data governance, and data storage and indexing. W e also expand our SQL commands to include those to read data in from outside sources, and to write the results of SQL queries to outside sources: in our case, we use a comma separated values (csv) format. Data Qualit refers to how fit RXUGDWDLVIRUWKHSXUSRVHVou intend. You could have a beautifullQRUPDOL]Hd relational database with perfectlXQLTXHSULPDU keVDQGIODZOHVVIRUHLJQNH relationships, but if all the customers’ names are spelled incorrectly, it will still not be verXVHIXOIRUPDUNHWLQJSXUSRVHV7KLVZHHN, we’ll cover the major waVWKLQJVFDQJRZURQJZLWKour data: it could be inaccurate, out of date, incomplete, or inconsistent with itself. We’ll also cover the major waVWKLQJVFDQJRZURQJZLWKour database itself: Ru can fail to have referential integrity , or RXUGDWDFRXOGIDLOWRIROORZEXVLQHVVUXOHV'DWDTXDOLW is such a huge corporate issue that manFRPSDQLHVKDYHHQWLUHWHDPVRISHRSOHGHGLFDWHGWRHQVXULQJGDWDTXDOLWy . In our studRI Database Design and Operations this week, we investigate structural problems with databases. Because databases are so preciselVSHFLILHGDQGEHFDXVHWKH run on computers, a structural problem can be catastrophic. We have seen in previous weeks how to set up a good database, in which all primarNHs are unique and all foreign keVOHDGWRVRPHZKHUH7KLVZHHNZHZLOOVWXG integritFRQVWUDLQWVZKLFKKHOp ensure that the foreign keVUHIHUWRH[LVWLQJGHVWLQDWLRQVDQGWKDWLIou delete a record, RXGRQ’ t bring down the database. We then turn our attention awaIURPWKHQRUPDOL]HGUHODWLRQDOGDWDEDVHVFKHPDVRRIWHQXVHGLQRSHUDWLRQDl data stores. W e look towards an alternative or ganization called the Snowflake or Star Schema . Rather than several smaller tables linked with keVWKHVQRZIODNHVFKHPDKDVRQHFHQWUDOL]HGILOHZLWKYDULRXV?DUPV? Rr points of the star) attached to it. Each “arm” contains the information RXQHHGWRTXHU along one dimension. For example, a large flat file could contain the date of a purchase, the purchased item, the shipment destination, and the customer information. One arm of one star could break the dates down into months, quarters, and HDUV Another arm of the star might break the shipment destination down into states (such as MarODQG RUUHJLRQs (such as the Pacific Northwest). This lets the customer run quick queries for things like “all orders which shipped in Quarter 1 of 2015, to the state of MarODQG?RU?DOORUGHUVXQGHUZKLFKVKLSSHGLQ0D of any HDr, to anZKHUHLQWKH3DFLILF1RUWKZHVW?4XHULHVRIWKLVWpe are included in a method called Dimensional AnalVLs . Data Governance: Or , the care and feeding of RXUGDWa Data is crucial to modern corporate operations, and HWLWFKDQJHVDOOWKHWLPH-XVWDVQHZEXLOGLQJFRQVWUXFWLRn needs to be maintained, and periodicallJLYHQDQHZURRIPRVWFRUSRUDWHGDWDQHHGVWRXQGHr go regular maintenance, and periodicallJLYHQDUHRrganization. Even if nothing went wrong with RXUGDWDSHULRGLFDOOy there are new laws (such as HIPAA) which require new measures to be taken with existing data. Data governance includes the activities a corporation needs to undertake to identifSUREOHPVZLWKLWVGDWDIL[WKHP, and deploWKHUHVXOWV Data Storage and Indexing: How do RXILQGour data quickl? With our small databases in this class, it’ s not reallQHFHVVDU to optimize for querVSHHGDFRPSXWHUFDn simplORRNWKURXJKHDFKRQHRIRXUIHZKXQGUHGDLUOLQHUHFRUGVWRILQGWKHRQHZHZDQW%XWZKDWLIou have a few million records? Or what if RXDUH)DFHERRNDQGDUHJHQHUDWLQJPRUHWKDQPLOOLRQ?OLNHV?HYHUy minute [1] ? Even if RXUTXHU can search one million rows a minute, if RXDUH)DFHERRNou are getting 4 minutes slower in RXUTXHULHVIRUHYHU minute of time which passes. Y ou started off behind and RXDUe slowing down. One solution is to use a data index (plural: several data indices ) to help cut RXUVHDUFKWLPH If RXKDYHDOLVWRIQDPHVVRUWHGDOSKDEHWLFDOOy , and the one RXZDQWVWDUWVZLWKD?4?ou don’ t need to start 9/7/2020 Data 620 - Lesson 4: Data Manipulation and SQL File Input/Output - DATA 620 9040 Data Management and V isualization (2208) https://learn.umgc.edu/d2l/le/content/512831/viewContent/19677149/V iew 2/3 searching at the “A.” Data indexing is a waWRFRQVWUXFWDVHULHVRIPDSVIRUour SQL query , so it can make use of alphabetical, numeric, and other sorted orders to accelerate RXUVHDUFK. SQL File Input/Output: How do RXread and write to external files? Up until now, our databases have magicallOLYHGZLWKLQWKH&ORXG&RPSXWLQJ/DEVDQGZHKDYHFUHDWHGDQd destroHGWKHPZLWKLQWKH&ORXG&RPSXWLQJ/DE’ s confines. But what happens if RXZDQWWRVDy, download a data file from the Center for Disease Control and run SQL queries on who got the flu? And then what if RXr querUHVXOWVDUHVREULOOLDQWou’d like to share them with RXUERVVZKRZLOORQO use Excel? There are commands in SQL RXFDQXVHWRWHOOLWKRZWRUHDGGDWDLQIURPDQH[WHUQDOIODWILOHDQGKRZWRZULWHRXWSXWWo an external flat file. We will show RXKRZWKRVHZRUNWKLVZHHN Required Readings for the W eek: 1. Data Quality M64/11 - Entit,QWHJULWy M64/5HIHUHQWLDO,QWHJULWy M64/'RPDLQ,QWHJULWy M64/3ULPDU Ke VWDUWVZLWKVRPHGHHUMXPSLQJLQDard) (We skip the M64/$XWR,QFUHPHQWou can certainlZDWFKLIou want to.) M64/)RUHLJQ.Hy ChrVOHU V'DWD4XDOLW Management Case Study Data Qualit&RQFHSWV'DWD4XDOLW T utorial, Data Warehousing Tutorial Data Qualit0DWWHUVT ech Vision 2018 T rend (Accenture) 2. Database Design and Operations Inmon vs. Kimball . This is a brief review of the normalized vs. star schema laRXWV. Star Schema Slides from Professor Majed Al-Ghandour . (Link refers to Data 620 Model Classroom 2158, material available in Week 4). 3. Data Governance Data Architecture: A Primer for the Data Scientist , b,QPRQDQG/LQVWHGW&KDSWHU5 Data Management Practices Across an Institution: SurveDQG5HSRUt Data Governance and Stewardship 4. Data Storage and Indexing Database Lesson #7 of 8 Database Indexes 5. SQL File input/output M64/5HIHUHQFH0DQXDl , Section 13.2.6 through Section 13.2.9.1. Also see Section 3.3.3, the LOAD command, for alternative waVWRORDGGDWDLQWRDWDEOH. Optional Readings for the W eek: 1. Data Quality 2. Database Design and Operations 3. Data Governance Managing and Sharing Your Data: Best Practice for Researchers 4. Data Storage and Indexing 5. SQL File Input/Output 9/7/2020 Data 620 - Lesson 4: Data Manipulation and SQL File Input/Output - DATA 620 9040 Data Management and V isualization (2208) https://learn.umgc.edu/d2l/le/content/512831/viewContent/19677149/V iew 3/3 [1] http://editorial.designtaxi.com/editorial-images/news-data14082015/big.jpg



waiting for experts to answer, check back soon.


Related Question