This document discusses techniques for visualizing high-dimensional data, including t-Distributed Stochastic Neighbor Embedding (t-SNE). t-SNE is used to visualize molecular data with thousands of features and won a Kaggle competition by mapping the data based on activity and time. The document also discusses limitations of single maps and introduces multiple maps t-SNE to better model relationships between different concepts.
3. Data visualization
• What can we do to visualize Big Data that has lots of variables?
• Make a scatter plot in which each point corresponds to a measurement
• Arrange the points such that nearby points model similar measurements
• How do we determine the locations of the points in the map?
• Techniques for dimension reduction, multidimensional scaling, or embedding
5. Embedding
• The input of an embedding algorithm is:
• Collection of high-dimensional data points or...
• Collection of pairwise (dis)similarities (a distance table)
• The output of an embedding algorithm is:
• Collection of low-dimensional data points (a map)
6. Principal components analysis
• Principal Components Analysis maps the data in a linear subspace, such that
the variance of the projected data is maximized:
wT x
14. Scaling up t-SNE
• Interpret evaluating t-SNE gradient as simulating an N-body system
• Use a Barnes-Hut algorithm to approximate t-SNE gradient in O(N log N )
15. Scaling up t-SNE
• Scale up t-SNE to large data sets (MNIST, N = 70K; T = 10m): 0
1
2
3
4
5
6
7
8
9
van der Maaten, 2013
16. Scaling up t-SNE
• Even to data sets with millions of data points (TIMIT, N = 1.1M; T = 3h 40m):
17. So how did you win 2000 bucks in an hour?
• Kaggle and Merck hosted a molecular activity visualization challenge:
• Features derived from molecules’ chemical structure
• Each molecule also has an activity value
• The data distribution somehow changes over time
• Visualize features using t-SNE, and color according to activity and time
18. Merck visualization (1)
Data set #8 colored by activity Data set #8 colored by time
1
10
0.9
9.5
0.8
9
0.7
8.5
0.6
8
0.5
7.5
0.4
7
0.3
6.5
0.2
6
0.1
5.5
19. Merck visualization (2)
Data set #8 colored by activity Data set #8 colored by time
10
0.9
9.5
0.8
9
0.7
8.5
0.6
8
0.5
7.5
0.4
7
0.3
6.5
0.2
6
0.1
5.5
0
20.
21. Limitations of using a single map
• Suppose we are visualizing words based on association data, or authors
based on co-authorships, or Enron emails, or scale-free networks, etc.
• How can we model the words “river”, “bank”, and “bailout” in a single map?
RIVER
BANK
BAILOUT
22. Multiple maps t-SNE
• Construct multiple maps, and give each object a point in each map
• Assign an importance weight to each point
• Define the similarity between two points under the multiple maps model as a
weighted sum over the similarities in the individual maps
Map 1 Map 2
1 RIVER 1 BAILOUT
½ BANK ½ BANK
van der Maaten & Hinton, MLJ 2012
23. Multiple maps t-SNE
• Definition of similarity under multiple maps model:
P (m) (m) (m) (m)
m i j (1 + kyi yj k2 ) 1
qj|i = P P (m0 ) (m0 ) (m0 ) (m0 )
m0 k⇥=i i k (1 + kyi y k k2 ) 1
• Herein, we define the importance weights as:
(m)
(m) exp(wi )
i = (m )
m exp(wi )
• All map coordinates and importance weights are learned jointly
van der Maaten & Hinton, 2012
24. SITE
LOCATION
AREA PLACE
WHERE PUT
SET
DEFENSE POSITION
OFFENSE STATUS
MONUMENT
STATUE CHARGE DECK
LIBERTY JOCK CREDIT
ARENA SPORTS CARD
FREEDOM STADIUM ATHLETIC POKER
OLYMPICS OFFICIAL
TOUCHDOWN ATHLETE REFEREE ACE
TACKLE CARDS SPADES
FOOTBALL VOLLEYBALL DEAL
FIELD MONOPOLY RUMMY
CHEERLEADER ACTIVITY DICE JEOPARDY SPADE JOKER
COACH SOCCERBASKETBALL
PLAYER SPORT CLUE
TEAM OPPONENT SOFTBALLGAME CHESS
EXCITEMENT UMPIRE CHECKERS PLAYING
OVERWHELM SQUAD BATBASEBALL
WORRY BANG SERIES
STRESS BASE
ANXIETY
IVY LEAGUEPITCHPITCHER
CATCHER
POPULAR
FAMOUS
CHEST
MODERN PENGUIN SHOELACE
TUXEDO VEST
CONTEMPORARY TIE SUIT
PROM JACKET COAT
FORMAL SWEATER
FANCY CASUAL PLAID LAPEL
LACE GOWN PATTERN COLLAR
FRILL DRESS STRIPE FLANNEL STARCH
SKIRT WEAR SHIRT
HEM SHORTS BLOUSESLEEVE
BRA
STRAP SEAM CREASE BUTTON CUFF
LEATHER POLYESTER PANTS
BELT TROUSERS
BUCKLE SLACKS ZIPPER
FASTEN LOOSEN POCKET
JEANS
WAIST DENIM
HIP SASH
KNITTING GRANDMA
GRANDPA
GRANDPARENTS
REPULSIVE
STALE RESPECT
FRESH ELDERS HERITAGE
ANCESTOR SLIME DISGUSTING
ELDERLY GOO VULGAR
WISE SLUG MAGGOT NASTY
UNUSEDCANE WALKER WORMSLIMYGROSS DISGUST
USEDNEWMODERNYOUNG TARNISH
WORN RUST YUCK
OLD FEEBLE
WRINKLE ANTIQUE BALD WART
ANCIENT ADORABLE
YOUTH CUTE MOLE
AGE DINOSAUR HANDSOME UGLY APPEARANCE
YEARS FOSSIL ATTRACTIVE LOOKS
PUBERTY GROWN PRETTY
BEAUTIFUL MODEL
GROW ADULT GROWTH SEXY GODDESS
DEVELOP IMMATURE GORGEOUS PINK BEAUTY
MATURE AWKWARD
RESPONSIBILITY CHEERLEADERBEAST
GUY
GAL GIRL
BUGLE BOY SACKKNAPSACK
SCOUT CARRY BAG LUNCH
PLAY DOUGH GLAD TOTE
CHILDREN TRICYCLE EGYPT
ADULTS KIDS
PARENTS
GROWN−UPS
UNSURE
SURE DEODORANT
POSITIVE CONFIDENT
CERTAIN
DEVICE BOUNDARY
CONSTITUTION BORDER
AMERICA FREEDOM
LINE TANGENT
USA INSTRUCTIONS
DEMOCRACY OLIGARCHY
REPUBLIC DIRECTIONS ERECT CURVE
STRAIGHT
DEMOCRAT MONARCHY FOLLOW INSTRUCTION CROOKED
REPUBLICAN ANARCHY RULES PROCEDURE
TAXES BUREAU OBEY UNEVEN BENT
FEDERAL RESTRICTION CURVED
CAMPAIGN OFFICIAL GOVERNMENT
LAW PRINCIPLE
PRESIDENT RULE
MAYOR POLITICS POLICY
LAWS
GOVERNOR INSURANCE
SENATOR LEGISLATURE
POLITICIAN CONGRESS
CORRUPT SENATE
25. KNITTING GRANDMA
GRANDPA
GRANDPARENTS
REPULSIVE
STALE RESPECT
FRESH ELDERS HERITAGE SLIME
ELDERLY ANCESTOR GOO DISGUSTING
SLUG MAGGOT NASTY VULGAR
WISE
UNUSEDCANE WALKER WORMSLIMYGROSS DISGUST
USEDNEWMODERNYOUNG TARNISH
WORN RUST YUCK
OLD FEEBLE
WRINKLE ANTIQUE BALD WART
ANCIENT ADORABLE
YOUTH CUTE MOLE
AGE DINOSAUR HANDSOME UGLY APPEARANCE
YEARS FOSSIL ATTRACTIVE LOOKS
PUBERTY GROWN PRETTY
BEAUTIFUL MODEL
GROW ADULT GROWTH SEXY GODDESS
DEVELOP IMMATURE GORGEOUS PINK BEAUTY
MATURE AWKWARD
RESPONSIBILITY CHEERLEADERBEAST
GUY
GAL GIRL
BUGLE BOY SACKKNAPSACK
SCOUT CARRY BAG LUNCH
PLAY DOUGH GLAD TOTE
CHILDREN TRICYCLE EGYPT
ADULTS KIDS
PARENTS
GROWN−UPS
UNSURE
SURE DEODORANT
POSITIVE CONFIDENT
CERTAIN
DEVICE BOUNDARY
CONSTITUTION BORDER
AMERICA FREEDOM
LINE TANGENT
USA INSTRUCTIONS
DEMOCRACY OLIGARCHY
REPUBLIC DIRECTIONS ERECT CURVE
STRAIGHT
DEMOCRAT MONARCHY FOLLOW INSTRUCTION CROOKED
REPUBLICAN ANARCHY RULES PROCEDURE
TAXES BUREAU OBEY UNEVEN BENT
FEDERAL RESTRICTION CURVED
CAMPAIGN OFFICIAL GOVERNMENT LAW PRINCIPLE
PRESIDENT RULE
MAYOR POLITICS POLICY
LAWS
GOVERNOR INSURANCE
SENATOR LEGISLATURE
POLITICIAN CONGRESS
CORRUPT SENATE
OZONE
LAYER
DEPLETION
SURROUNDINGS
ENVIRONMENT
SURROUNDING
INTEREST
WELL−BEING
POPEYE
SPINACH
CARTOON
ROMAN
EMPIRE
KEEPER EDGE STALK DILL
MOAT CORN JUICE PICKL
CASTLE DYNASTY LATCH SCARECROW BEETLE PICKLES
PALACE KINGDOM KEY KEYS RING
CHINA LOCK
COMBINATION SOW
FENCE LONG REAP
GATE HARVEST
EMPEROR
THRONE TURN
ROYALTY HINGE CLOSED VEER
PRINCE CROWN KNOCK
KING MAT MINDED
PRINCESS QUEEN DOOR GARAGE OPEN
ENGLAND
ROYAL HANDLE VACANCY
MONARCHY KNOB CLOSE
SHUT INTIMATE
MONARCH THRESHOLD
ENTRANCE
RULER DICTATOR DOORWAY LOCAL
LIGHTNING
DISTANT BOLT
PASSAGE FAR APART
HALLWAY
HALL BREEZEWAY DISTANCE AWAY
BEYOND
CORRIDOR FURTHER
CLOSING
OPENING
26. I want to give this stuff a try!
• Type “t-SNE” into Google, and click the first link
• You’ll find papers, examples, and implementations (in Matlab, Python, R, and C++)
• You can also drop me a line: lvdmaaten@gmail.com