Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

Presented at IEEE BigData 2017, Boston, on Dec 11, 2017
in the Workshop of "3rd International Workshop on Methodologies to Improve Big Data projects".
The author is Toshiyuki Shimono, Digital Garage, Inc.

(This is PDF format instead of MS Powerpoint format for the sake of significantly smaller file size.)

  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)

  1. 1. Make Accumulated Data in Companies Eloquent by SQL Statement Constructors IEEE BigData 2017 (Boston) Dec. 11 Toshiyuki Shimono Digital Garage, Inc.
  2. 2. Work Contributions 1. For exploring an unknown DB, Ø Organized the milestones. Ø Conceived the methods. 2. Proposed a beneficial software tool for “Big Data”. Ø Seems that no other tools except [Shimono 16], based on the surveys [Saltz, Shamshurin 16], [Kumar, Alencar 16]. 3. Reducing labor to understand a DB. Ø By shrinking it from months to a week. Ø “Knowing” latently dominates a data analysis project. A similar slide appears again in the ending. 2
  3. 3. I. Background 6 slides 3
  4. 4. Background • Many organizations have accumulated their own business data, in the recent years. • But actually, their DB are rarely well designed. Thus their data is far from being fully utilized. 4
  5. 5. The real situation today as of 2017 The data accumulated is so big and complex. Which part of it should be taken for analysis? Which tables are needed? Which columns are needed? Where is the meaningful date/user columns? How many bytes will be exported? What do the tables/columns mean? How can the dates/customers be narrowed down? How is the exported data damage detected? And the database system is so old in that meaningful “pre-analysis” is very difficult! 5
  6. 6. How can data scientists utilize data? So many tables and columns. Difficulties occur in: Øknowing the meanings, Øreading the documents, Ødiscussing with the clients. What is the good way for utilizing data sleeping in the data base ? 6 ▲ Many tables, each entailing many columns !
  7. 7. 1. Understanding DB 3. Serendipitous discovery in Business by some advanced analysis 2. New environment building for analysis Ø Are the data effective to analyze? Ø What are special/error values? Ø How columns are connected over tables? 7
  8. 8. Preliminary Knowledge •Database is a collection of tables. •A table is like a MS Excel sheet, with rows (records) and columns (attributes). •Many Databases are handled by SQL. such as MySQL, Oracle, PostgreSQL, SQL Server. 8
  9. 9. Ø Some columns are connected to share the same coding system. Ø Then how can one determine all of the connected columns? 9uOne needs to see the values of each column, but how? Retouched. https://commons.wikimedia.org/ wiki/File:Data_model_in_ER.png
  10. 10. II. Introductions of the novel software 10 slides 10
  11. 11. •Current software doesn’t cover the needs today. •New software is necessary. •So I created it for SQL-type DB. 11
  12. 12. Assumption in environment n CLI (Command Line Interface) n to produce SQL statements. n to store the data. n to process the data. n SQL-type DB. ▲ Command Line Interface (CLI) ▲ SQL Client software SQL statements are entered here. The SQL output appears here. 12
  13. 13. Complement : SQL statements create table T (x numeric, y varchar, z date) ß Make a table. insert into T values (2, ’abc’ , ’2017-12-11’) ß add a record. select * from T ß output all the records of T. select count(*) from T ß The record number of T. 13
  14. 14. To get an “listed” result by SQL : 1. Prepare the SQL statement : 2. The returned table by the SQL statement : This output is much useful, but the SQL statement is too long to manually enter. So SQL statement generator is desirable! 14
  15. 15. Time flies. Time info are helpful. Process-time and date-time information are attached. 15 A combination of (20 of) SQL statements yielded by the same command (using an option).
  16. 16. 16 SQL statements are yielded in the CLI environment.
  17. 17. 20 commands for many functions Each command receives either : (1) table names or (2) column names with their table names Each command utilizes option switches : --help : to show the online manual -a, -b, -c .. , -z : various minor functions. 17
  18. 18. The program functions 18 Program name What function the produced SQL statement(s) has. serverInfo SQL DB system version information. tableLines Counting the records of each table. tableColumns Column information of all. sampleRows Random sampling of rows. minMax Taking min/max of each columns. Also taking the 4 values. mostFreq/FewId Taking the most/few frequent values of each columns. distinctCount Counting distinct values of each columns. hasChar/nullCount Counting the values with specific character or null values. byteTable/byteCol Compute or estimate the byte-size of each table or column. vennTwo To calculate how sets of values overlap. newTable Creating a table with ease. hashSum Summing numerically mapped SHA-1 value to compare tables.
  19. 19. SQL generators Demo (PowerPoint animation) 19
  20. 20. GitHub page (program repository) 20 Find the webpage: github.com/tulamili Both English and Japanese skills are necessary to use it. Sorry!
  21. 21. To improve the UI on CLI : • I have been creating commands one by one with the policy of “using 2 English words for a program”. • To keep clean the name space of Unix/Linux command names, UI should be altered. • Next step would be like this, the style of using a command argument to specify the function. 21 Skip this page unless time is enough.
  22. 22. III. Tricky functions to see the values of DB 11 slides 22
  23. 23. What is the most concise way to see the values ?? Ø Column names don’t usually tell if the values are : Ø substance name (man, woman, Japan, USA, ..) Ø coded value (1,2, JP, US,…) Ø The column relations over tables are uneasy to see. Ø Knowing the special/error values is a craftwork. seeing the concrete values is a must. 23
  24. 24. Idea to get 4 values from each column 24 (1) Color the values if their first character is the minimum character. (2) From the colored values, extract the minimum and the maximum. (3) From the uncolored values, extract the minimum and the maximum. (4) Those 4 values *would* tell the column characteristic well J What is the good/simple method to get some typical values from a column if its data type is text, number, date or whatever?
  25. 25. The Venn Diagram and SQL statement All the values from a column All those whose first digit is the minimum. All the others The minimum of the above v11 The maximum of the above v12 The minimum of the above v21 The maximum of the above v22 select C from T where left(C,1) = (select min(left(C,1)) from T) select C from T where left(C,1) != (select min(left(C,1)) from T) select min(C), max(C) from T where left(C,1) = (select min(left(C,1)) from T) select min(C), max(C) from T where left(C,1) != (select min(left(C,1)) from T) 25 Skip this page unless time is enough.
  26. 26. Are the 4 values enough to see a column? 26 Skip this page unless time is enough. • 2 values (e.g. min/max) would not work L. • 4 values can cause misleading in small possibility, but it actually works well as shown later, so far. • How about 5 or 6 or more values : • The min/max from the 3rd set can be added. • Indeed good to see the various/lengthy text values J . • But it is becoming not simple. Requiring complex SQL. • Much computation time as I once tried L .
  27. 27. Applied to the whole columns 27 Time dimension Time dimension Time dimension User dimension User dimension Weight dimension Non-numeric order Numeric order
  28. 28. What if only 2 values? 28Hided meaningful minimum 2014-07-07. Hided meaningful minimum “-9990”. Hided meaningful minimum “-5”. Hided existence of “00000”. This table only assures there exists at least 2 distinct values for each original column. (3 or 4 instead of 2 is desirable.) Only 1 country codes can be seen due to the existence of special value “ZZZ”. Skip this page unless time is enough. select min(C), max(C) from T
  29. 29. Complement: SQL statements. 29
  30. 30. Relations found sharing same codes 30
  31. 31. Special/anomalous values 31 Cf. random sampling
  32. 32. How about the random sampling? 1. SQL statement building 2. The SQL output (part of the results) 32
  33. 33. Randomness helps to see column relevance • “Age” and “marriage” are yellow-back colored. • Probably, 1 means married, 2 means unmarried. 33
  34. 34. Estimating how 2 tables differ 34 Assume you can access to both “the running DB” and its exported data. Exporting may take a lot of time, so “data change by time” occurs. Then, how you can estimate the number of different records between the two? Establishing the method is required. I tried using SHA-1 function to see the difference. Only 6 lines was the line counting difference, but actually they differ in somewhere [83,441] in the 95% confidence. You can assume each of the 3 values on a row like a Gaussian variable according to the variance determined by the number of the record of each table, T1, T2, and (T1-T2) U (T2-T1), respectively. By changing the conditions, you got the 12 repetition measurements. Then you can assume the record numbers based on the estimation of the population variance as shown in the lower part of the table. Skip this page unless time is enough.
  35. 35. Continued from the previous slide. 35 If you sum up N variables (in i.i.d.) from a distribution with the mean zero and the variance one, then the summation obeys a distribution which is well approximated by the normal distribution with the mean zero and the variance N. If you got such number in the repetition of K times, then how is it possible to estimate the number N backward? It can be estimated from the total of the square of that K values divided by 2.5%-tile and 97.5-tile points of the chi square distribution with K degrees of freedom. And an easy computation of taking a variable with the mean zero and the variance one is to transform the SHA-1 value into [ - sqrt(3) , sqrt(3) ] . Skip this page unless time is enough.
  36. 36. IV. Summary 3 slides 36
  37. 37. Row numbers, table comparison. The 4 value taking, random sampling Determining all the same code sharing. Random sampling from the special lines. Steps to know DB toward analysis : 1. Knowing the tables. 2. Knowing the columns (individually). 3. Knowing the column connections (relations). 4. Knowing how (row-wise) special conditions occur. Those above should be fulfilled before going beyond. 37
  38. 38. DB SQL cmd generator Table info Column info Short-cutting operations Extracted info Findings before main-analysis Generated SQL cmd Concrete values ü Value formats ü Special/err values ü Columns’ relations ØMeanings Simpler table(s) <- column selecting <- time(date) narrowing <- customer narrowing Ø Visualization Ø Math methods Business Value by main-analysis Big discovery from data + Big business values
  39. 39. An application example. 1. You may have a lot of tables. 2. You understand each column of them by : • seeing some of the concrete values, • seeing the special and anomalous values, • determining all of the same code sharing columns. 3. Thereafter, you can : 1. narrow down to modest-sized tables. 2. easily handle the data for visualizations. 3. summarize the data you need into one table that can be handled by many mathematical methods. 39
  40. 40. 40
  41. 41. Contributions (summary) 1. For exploring an unknown DB, Ø Organized the milestones. Ø Conceived the methods. 2. Proposed a beneficial software tool for Big Data. Ø Seems that no other tools except [Shimono 16], based on the surveys [Saltz, Shamshurin 16], [Kumar, Alencar 16]. 3. Reducing labor to understand a DB. Ø By shrinking it from months only to a week. Ø “Knowing” latently dominates a data analysis project. 41
  42. 42. V. Extra Slides 9 slides 42
  43. 43. We must “understand DB contents” ⏤ before any analysis Reasons : 1. Effectiveness check for analysis purpose 2. Seeing typical/special/anomalous values 3. Handling relations among columns 4. Rebuilding another DB environments 43
  44. 44. 1. Understanding DB 3. Business-related calculation (monthly sales,..) Advanced analysis employing math-related methods 2. New environment building for analysis ※ Note: Preprocessing exists everywhere, but we do not touch this explicitly. 1. Effectiveness check for analysis purpose 2. Seeing typical/special/anomalous values 3. Handling relations among columns 4. Rebuilding another DB environments Reasons why we must understand DB : We focus on DB understanding 44
  45. 45. Squirrel SQL (since 2001) 45
  46. 46. Detail: Line Number Listing Example The command “lineNumber” can yield various type of SQL statements by utilizing command option such as -n, -t. To properly make output, it is designed so that the SQL output contains : 1) sequence number, 2) table (and column) names, 3) process time in seconds, 4) the time of calculations. 46
  47. 47. ü A DB has tables which have columns which have values. Ø One needs to determine column connections over tables. 47 © 2013 Microsoft Corporation. All rights reserved. uOne needs to see the values of each column, but how?
  48. 48. How to see values in each column. 48
  49. 49. To output an “integrated” table by SQL : 1. A new command yields a SQL statement : 2. The returned table by query of the SQL statement : This output is much useful, but the SQL statement is too long to manually enter. Thus SQL statement generator is desirable! Outputted by “newCmd < tables.txt” 49
  50. 50. Deciphering so many columns at once. 50
  51. 51. Are the 4 values enough to see a column? 51 Skip this page unless time is enough. • Only 2 values (e.g. min/max) would not work L. • Only 4 values may cause misleading possibly L. • Aligning more than 4 values : • The min/max from some (the third) set can be added. • Indeed good to see the various/lengthy text values J . • Much computation time as I once tried L . • SQL may really need “second_min” and “second_max”. • Misc. • Null value care is desirable. • The frequency number may be desirable. • The value lengths information is helpful.
  52. 52. 52
  53. 53. Random Sampling, also weighting 53
  54. 54. Remaining issues ⏤ before to build new DB env. for analysis Combinational explosion in calculation can occur to reduce the redundancy of columns/rows. • Grasping all the redundant columns through knowing the relations inside a table: 1. The values of 2 or more columns of every row has the same values. 2. The values of a column can be determined other column values. • Grasping all the redundant rows • How to know the condition whether a column has a value of null, special, anomalous, rare values, when the other column values seems to have clue? 54
  55. 55. 55
  56. 56. DB SQL cmd generator Table info Column info Short-cutting operations Extracted info Findings before main-analysis Generated SQL cmd Concrete values ü Value formats ü Special/err values ü Columns’ relations ØMeanings Simpler table(s) <- column selecting <- time(date) narrowing <- customer narrowing Ø Visualization Ø Math methods Business Value by main-analysis Big discovery from data + Big business values

×