Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Who Wants To Be a Munger
1. So Who Wants to
Be a Munger?
Dana
LSRC 2009
Friday, August 28, 2009
2. Who am I?
• Dana
• 8 years in corporate world
• Responsible for munging a massive
amount of data every day
• Now develop Rails Applications for a
living
Friday, August 28, 2009
3. Why is this important?
• We live in a data • Important to know
driven society what data you have
and what needs to
• Companies feed on happen with it
reports
• The more you know
• Clients have data and about the final output,
want ways to display the easier you can
it manipulate the data
Friday, August 28, 2009
5. The Rule of 3
In - Munge - Out
• Read data into some construct
• anything that understands each()
• Transform the data
• Output transformed data
• some format that understands
puts()
Friday, August 28, 2009
7. A Basic Munging Script
The output file open("new_numbers.txt", "w") do |f|
The input file File.foreach("numbers.txt") do |n|
The transformation n.capitalize!
f.puts n
end
end
one One
two Two
three Three
four Four
five Five
Friday, August 28, 2009
8. Simplify pass out
pass some
another
object to
object as
munge
output
• Don’t confuse reading with def munge(input, output)
munging input.each do |record|
record.capitalize!
• May have to read various output.puts record
end
files for the same output
end
• Use Ruby’s each() and
puts() methods to your
advantage
Friday, August 28, 2009
9. Why this is better
names = %w[dana james sarah storm gypsy] numbers = open("numbers.txt")
stream = $stdout stream = open("new_numbers.txt", "w")
munge(names, stream) munge(numbers, stream)
Friday, August 28, 2009
10. each() and puts()
class Rubyist
def each
yield "i"
yield "love"
yield "ruby"
end
end
class Speaker
def puts(words)
`say #{words}`
end
end
Friday, August 28, 2009
11. Reaching ultimate
munging power
class Munger
m = Munger.new(open("numbers.txt"),
def initialize(input, output)
open("new_numbers.txt", "w"))
@input = input
m.munge do |n|
@output = output
n.strip!
end
if n =~ /At/i
n.reverse
def munge
elsif n == "four"
@input.each do |record|
nil
munged = yield(record)
else
@output.puts munged unless munged.nil?
n.capitalize
end
end
end
end
end
Friday, August 28, 2009
12. Data
• Different kinds of data
• Structured - record oriented data
• Unstructured
• Most difficult to work with
• Vast majority of data reading is
pattern matching
Friday, August 28, 2009
14. require "munger"
class RossReader
def initialize(file)
@file = file
end
def each
open(@file) do |report|
report.each do |line|
break if line =~ /AReport Totals/
next if line =~ /As+z/ or
line =~ /As+-/ or
line =~ /b(sub)?totalsb/i
yield line
end # report.each
end # open
end # def
end
report = Munger.new(RossReader.new("sample_report.txt"), open("ross_writer.txt", "w"))
report.munge do |n|
n
end
Friday, August 28, 2009
16. Ugly Headers
SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 6
Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var
--------------- ------------------------ ---------- ---------- ------- ---------- ---------- -------
Salesperson 22 BILL PRICE
Customer 1014 KECK'S MEAT & FOODSERVICE
SA Sort Code 4.42 PORK RIBS
SAA_R_009 26-Mar-2009 15:26 1: BOB's BILLARD HALL Page 7
Part Code Description Qty Period QTY LastYr QTY Var Lbs Period Lbs LastYr Lbs Var
--------------- ------------------------ ---------- ---------- ------- ---------- ---------- -------
Friday, August 28, 2009
17. unpack()
• Designed for breaking • “a” means ascii
up binary data character
• Very handy for this • “x” means skip
kind of fixed-width
"cookies and cream".unpack("a7xa3xa5")
work
["cookies", "and", "cream"]
• unpack() takes in a
format string "--- --- -----".split.
map {d|"a#{d.length}" }.join("x")
• You describe what "a3xa3xa5"
the data looks like
Friday, August 28, 2009
18. def initialize(file)
@file = file
@headers = nil
@format = nil
end
def each
open(@file) do |report|
parse_header(Array.new(4) { report.gets })
report.each do |line|
...
end # report.each
end # open
end # def
def parse_header(headers)
@format = headers[3].split.map { |col| "a#{col.size}" }.join("x")
@headers = headers[2].unpack(@format).map { |f| f.strip }
end
Friday, August 28, 2009
19. def initialize(file)
@file = file
@in_header = false
@headers = nil
@format = nil
end
def each
open(@file) do |report|
parse_header(Array.new(4) { report.gets })
report.each do |line|
if line =~ /ASAA_R/
@in_header = true
elsif @in_header
@in_header = false if line =~ /A-/
else
...
end
end # report.each
end # open
end # def
Friday, August 28, 2009
21. assoc()
• lookup method • slower than a hash -
don’t use on LARGE
• call it on an array of amounts of data
arrays
• assoc() becomes a poor
• pass in the data you man’s ordered hash
want to lookup
names = [["James" , "Gray"], ["Dana", "Gray"]]
• walks through the puts names.assoc("James")
outer array and
returns the inner
["James", "Gray"]
array that starts with
the argument
Friday, August 28, 2009
22. def initialize(file)
...
@categories = []
end
def each
open(@file) do |report| ...
if line =~ /As+(w[ws]+?)s+(d.+?)s+z/
if cat = @categories.assoc($1)
cat[-1] = $2
else
@categories << [$1, $2]
end
else
yield @headers.zip(line.unpack(@format).map { |f| f.strip }) + @categories
end
end
end # report.each
end # open
end # def
Friday, August 28, 2009
24. def each
open(@file) do |report|
parse_header(Array.new(4) { report.gets })
report.each do |line|
if line =~ /ASAA_R/
@in_header = true
class RossReader elsif @in_header
@in_header = false if line =~ /A-/
def initialize(file) else
@file = file break if line =~ /AReport Totals/
@in_header = false next if line =~ /As+z/ or
@headers = nil line =~ /As+-/ or
@format = nil line =~ /b(sub)?totalsb/i
@categories = [] if line =~ /As+(w[ws]+?)s+(d.+?)s+z/
end if cat = @categories.assoc($1)
cat[-1] = $2
def parse_header(headers) else
@format = headers[3].split.map { @categories << [$1, $2]
|col| "a#{col.size}" }.join("x") end
@headers = headers[2].unpack(@format).map { else
|f| f.strip } yield @headers.zip(line.unpack(@format).map {
end |f| f.strip }) + @categories
end
end
end # report.each
end # open
end # def
end
Friday, August 28, 2009
28. require "munger"
require "ross_reader"
require "csv_writer"
report = Munger.new(RossReader.new(ARGV.shift), CSVWriter.new)
report.munge do |record|
record.each do |field|
if field.last =~ /A(?:d+,)+d+k?z/
field.last.delete!(",")
end
field.last.sub!(/Ad+kz/) { |num| num.to_i * 1000 }
end
record
end
Friday, August 28, 2009
29. So what can I do
with all this?
• Output your data into a spreadsheet
such as Excel
• Open the data in your text editor
• Import the data into a database
• Let’s see it in action
Friday, August 28, 2009