SlideShare verwendet Cookies, um die Funktionalität und Leistungsfähigkeit der Webseite zu verbessern und Ihnen relevante Werbung bereitzustellen. Wenn Sie diese Webseite weiter besuchen, erklären Sie sich mit der Verwendung von Cookies auf dieser Seite einverstanden. Lesen Sie bitte unsere Nutzervereinbarung und die Datenschutzrichtlinie.
SlideShare verwendet Cookies, um die Funktionalität und Leistungsfähigkeit der Webseite zu verbessern und Ihnen relevante Werbung bereitzustellen. Wenn Sie diese Webseite weiter besuchen, erklären Sie sich mit der Verwendung von Cookies auf dieser Seite einverstanden. Lesen Sie bitte unsere unsere Datenschutzrichtlinie und die Nutzervereinbarung.
Despite the existence of data analysis tools such as R, SQL, Excel and others, it is still insufficient to cope with today's big data analysis needs.
The author proposes a CUI (Character User Interface) toolset with dozens of functions to neatly handle tabular data in TSV (Tab Separated Values) files.
It implements many basic and useful functions that have not been implemented in existing software with each function borrowing the ideas of Unix philosophy and covering the most frequent pre-analysis tasks during the initial exploratory stage of data analysis projects.
Also, it greatly speeds up basic analysis tasks, such as drawing cross tables, Venn diagrams, etc., while existing software inevitably requires rather complicated programming and debugging processes for even these basic tasks.
Here, tabular data mainly means TSV (Tab-Separated Values) files as well as other CSV (Comma Separated Value)-type files which are all widely used for storing data and suitable for data analysis.
Without knowing anything above may cause tedious step-back.
Washingtondc b20161214 (2/3)
A Hacking Toolset
for Big Tabular Files
1IEEE BigData 2016, Washington DC, Dec 5.
1. First of all..
2IEEE BigData 2016, Washington DC
What my software is like?
• Handles tabular data files (esp. TSV)
in a neat and advanced style.
• A bunch of more than 50 functions in CUI.
• A supplement of other analysis software.
including Excel, SQL, R, Python, etc.
• Speedy enough among existing software.
3IEEE BigData 2016, Washington DC
As a result,
• Data analysis projects become
super swift at
initial-deciphering and pre-process
by providing 50+ useful functions.
• Each function, being CUI,
has several command-line options,
often can handle the heading line well,
often utilizes color properly,
often handles Ctrl + C signal wisely.
Coloring is utilized a lot.
For example, you can easily read big
numbers of this big data age,
by periodic coloring.
4IEEE BigData 2016, Washington DC
As of 2016-08-18.
5IEEE BigData 2016, Washington DC
IEEE BigData 2016, Washington DC 6
Why I have created it?
• Reducing the labor of data pre-processing
deserves a lot!
• You have gave up a lot of analysis, hypothesis-
verifications just because it takes time.
• But this situation would change dramatically!
• The author noticed that :
• Debugging/testing of programs takes a lot of time.
• Thus if unnoticed routines are neatly programmed and
prepared beforehand, the change would happen.
• So I designed a kind of philosophy to implement those
routines, and am programming.
• And actually swift data analysis is in high
IEEE BigData 2016, Washington DC 7
Designing philosophy in
• Each command works independently.
• The name of a command is a combination
within 2 English words with some exception.
• Swiftness is a key factor in many aspects:
• To recall the existence and the name of a
• Ease of understanding how a program works
Comparison with other soft.
• Unix/Linux :
• Not matured enough for tabular files.
• Note: Unix philosophy is worth while utilizing.
• Excel :
• Problems exist in transparency, in reproduction,
and in coordinative work with other software.
• Against SQL :
• Requires elaborate designing of tables.
• R, Python :
• Requires loading timewhen many and large files.
8IEEE BigData 2016, Washington DC
2. Provided Functions.
9IEEE BigData 2016, Washington DC
Examples of functions.
• Venn Diagram (complex condition cardinalities)
• Cross-table (2-way contingency table)
• Column extraction (simpler than AWK)
• Table looking-up (cf. SQL join, Excel vlookup)
• Key-Value handling (eg. sameness check for 2 KV-tables)
• Random sampling, Shuffling (line-wise)
• Quantiles, Integrated histogram
10IEEE BigData 2016, Washington DC
IEEE BigData 2016, Washington DC 11
Other useful functions
• Putting colors on character strings
• Self-manual also with its shorter version.
• Heading line handling. (Omitting, Utilizing)
• Showing ongoing result by Ctrl+C
• Decimal floating calculation if appropriate.
12IEEE BigData 2016, Washington DC
3. Function Examples
IEEE BigData 2016, Washington DC 13
colsummary : grasping what each column is.
1st (white) : column number
2nd (green) : how many distinct values in the column
3rd (blue) : numerical average (optionally can be omitted)
4th (yellow) : column name
5th (white) : value range ( min .. max )
6th (gray) : most frequent values ( quantity can be specified. )
7th (green) : frequency ( highest .. lowest, x means multiplicity )
When you get new data, deciphering the data starts.
Extracting those information above are
fairly enough to go on next analysis step.
14IEEE BigData 2016, Washington DC
# of different values
values Frequencies of the most and the fewest values.
The multiplicity of each frequency is present after “x”
The numbers of medals Japan
got in past Olympic Games
The biggest hurdle at data
deciphering quickly goes
away by this command.
IEEE BigData 2016, Washington DC
Freq : How many lines are what string ?
Functionally equivalent to Unix/Linux “ sort | uniq -c ” ,
but much more speedy in computation.
Note that many sub-functions, such as specifying output order,
can be implemented, so the author already implemented.
16IEEE BigData 2016, Washington DC
2-way contingency table
• Given 2-columned table, counting the number of lines
consist of ui and vj and tabulating often occurs.
• You can do it by “Excel Pivot Table” function.
• Note that it is impossible to perform by an SQL query.
17IEEE BigData 2016, Washington DC
2-way contingency table (crosstable)
• Almost impossible by a SQL query.
• Excel pivot table is actually tedious/error-
18IEEE BigData 2016, Washington DC
crosstable (2-way contingency
table) Provides the cross-table from
2 columned table
(Add blue color on “0” )(Extract 3rd and 4th columns)
You may draw many cross-table from a table data.
The crosstable commands provides cross-tables very quickly. 19IEEE BigData 2016, Washington DC
cols : extracting columns
• Easier than AWK and Unix-cut .
cols –t 2 ⇒ moves the 2nd column to rightmost.
cols –h 3 ⇒ moves the 3rd column to leftmost.
cols –p 5,9..7 ⇒ shows 5th,9th,8th,7th columns.
cols –d 6..9 ⇒ shows except 6th,7th,8th,9th columns.
-d stands for deleting, -p for printing,
-h for head, -t for tail.
20IEEE BigData 2016, Washington DC
IEEE BigData 2016, Washington DC 21
“cols” – column extraction.
The existing commands is not enough:
cut : you cannot realize the intention of “cut –f 2,1” ..
awk : one line program is easy but insufficient.
“cols” has many functions :
Range specification : “1..5” -> 1,2,3,4,5。 “5..1” -> 5,4,3,2,1
Negative specification : -N means N-th line from the rightmost.
Deleting columns : -d 4 -> omitting 4th column.
Head/tail : -h 3 -> moving 3rd column into the leftmost.
You can also specify the column name accordingly to the 1st line.
Speedy enough. ( awk > cols > cut )
The author also prepared awk command sentece generator
because awk is doubly faster than ”cols.”
Determining the cardinality of all the regions
is important before the main analysis.
22IEEE BigData 2016, Washington DC
venn4 (Cardinalities for complex
• The cardinalities of the concerning data set are important.
• As a technique, determining the cardinalities of the records of your data files in
the beginning greatly helps the following tasks. Because guessing what one of
your data files is can be easily told from the number of the records.
• Functions of venn4 is actually difficult by your hand. Try it if you doubt!
23IEEE BigData 2016, Washington DC
Comparison with other packages
Excel R Pandas SQL Hadoop Unix Our soft
Statistics Matrix etc.
Ease of use ◎ ○ △ △ × ○ ○
Transparency × ○ ○ depends depends ○ ○
△ △ Too new ○-◎ Too new ◎ Too new
Large size data
○ specific ◎ specific ◎
× △ -- -- ○ ◎
◎ ◎ ○
Need to know
× ◎ ◎
Similar to R Column name Column name
And also by range
◎: very good ; ○ good ; △ moderate ; × bad.
IEEE BigData 2016, Washington DC
IEEE BigData 2016, Washington DC 25
How you will use it.
1. Get the data or get the access to data.
2. Transform into TSV if it is necessary.
3. Manipulate it with the operation such as,
extract columns/lines that you like or randomly,
4. Accumulate the findings
E.G. copy/paste into Excel, PowerPoint.
IEEE BigData 2016, Washington DC 26
IEEE BigData 2016, Washington DC 28
Grasping a tabular data.
• By taking a look by editor.
• By taking a look by Unix less command.
• By reading the description documents..
• Extracting 1,2,4,8,16,32,.. or 1, 10, 100, .. lines.
• Randomly sampling lines with some low
Anyway, check whether the data is suitable
enough to process some specific operation is
fairly difficult to tell. It requires a lot of checking.
IEEE BigData 2016, Washington DC 29
Prepared function groups.
• Transforming formats
• Grasping how each given tabular data is like
• Operating Column-specific functions
• Treating as a 2-columned key + value table
• Combining tables such as looking-up
• Normalizing functions of relative data base
• Seeing/Comparing the distribution(s) of
IEEE BigData 2016, Washington DC 30
Interface designing policy
• Each command name consists of 1 or 2 words.
• Option switches are utilized a lot.
-a -b .. -z ← Various minor additional functions.
-A -B .. -Z ← Great mode change in functions.
-= ← Specifications that treating the beginning lines
properly as a heading line. i.e. list of column
-~ ← Specifications that turning over some small
• Swiftness is the important factor :
To recall functions of which existence and name.
To understand what will happen for users.
As speedy as other Unix commands.
31IEEE BigData 2016, Washington DC
Ex-2) silhouette (integrated histogram)
1. Suppose you are interested in the distribution
of the ratio of the follower number and the
following number of each Twitter account.
2. To properly understand the ratio distribution,
stratifying according to the number of the
follower of each account is also considered :
≧5000, ≧ 500 from the rest, ≧ 50 ditto, and the others.
3. The silhouette command given the data
provides the PDF file as shown in right.
4. You can read that with 48% possibility, the
account with ≧5000 followers has followers
that are more than 2 times of the number of
(On average, 95% twitter accounts does not.)
0% 25% 50% 75% 100%
The command name
from the image when
various height people
are aligned in the order
of height. See the curve
connecting the heads.
33IEEE BigData 2016, Washington DC
Features and summaries
Useful functions for analyses in TSV.
Especially in deciphering and pre-processing.
Handles a heading line properly if it exists.
Speeds comparable to existing software.
Can be used on any Perl installed machine.
The ongoing result displayed by Ctrl+C.
The software will appear on GITHUB.
34IEEE BigData 2016, Washington DC
35IEEE BigData 2016, Washington DC
follower 数(青) の累積ヒス
Green : Following #
Blue : Followers #
of millions of twitter
Same plot in
0% 25% 50% 75% 100%
IEEE BigData 2016, Washington DC