SlideShare a Scribd company logo
1 of 81
Download to read offline
A importância dos dados em sua
arquitetura… uma visão muito além
do SQL Server

Alexandre Porcelli - @porcelli
Gleicon Moraes - @gleicon
Alexandre Porcelli
                                                                            Creator &
                                                                             Dictator




Alexandre Porcelli
Writer


                     Alexandre Porcelli
                     Organizer

                                          Alexandre Porcelli
                                          Commiter / Parser Developer


                                                                          Alexandre Porcelli
                                                                          Core Developer / API Designer
Gleicon Moraes

http://zenmachine.wordpress.com
     http://github.com/gleicon
              @gleicon
existe um mundo
além do...
ou do...
além, até mesmo
dos...
inclusive do...
próximo do
mundo sombrio
do...
nosql
uma
nova
escola
contexto
falta de capital
big data
história...
modelos
  • Hierarchical (IMS): late 1960’s and 1970’s
  • Directed graph (CODASYL): 1970’s
  • Relational: 1970’s and early 1980’s
  • Entity-Relationship: 1970’s
  • Extended Relational: 1980’s
  • Semantic: late 1970’s and 1980’s
  • Object-oriented: late 1980’s and early 1990’s
  • Object-relational: late 1980’s and early 1990’s
  • Semi-structured (XML): late 1990’s to late 2000’s
  • The next big thing: ???




             ref: What Goes Around Comes Around por Michael Stonebraker e Joey Hellerstein
next big thing?
definição...
abaixo ao
 banco de
  dados
relacional!
abaixo ao banco de
  dados relacional!

como bala
 de prata!
momento
histórico...
resolver
problemas
específicos
estrutura
de dados
chave-valor
modelo
família de colunas
modelo
                              Keyspace

                          Família de Colunas

                                linha
  chave
          coluna     coluna     coluna coluna          coluna . . .   coluna

                                   .
                                   .
                                   .

                                linha
  chave
          coluna          coluna               coluna        ...      coluna




                                Coluna

                   nome   timestamp            valor
documento
modelo
grafo
visão geral
arquitetura
Architectural Anti Patterns
Notes on Data Distribution and Handling Failures
Fail
Anti Patterns

•   Evolution from SQL Anti Patterns (NoSQL:br May 2010)
•   More than just RDBMS
•   Large volumes of data
•   Distribution
•   Architecture
•   Research on other tools
•   Message Queues, DHT, Job Schedulers, NoSQL
•   Indexing, Map/Reduce
•   New revision since QConSP 2010: included Hierarchical
    Sharding, Embedded lists and Distributed Global Locking
RDBMS Anti Patterns
Not all things fit on a relational database, single ou distributed

•   The eternal table-as-a-tree
•   Dynamic table creation
•   Table as cache, queue, log file
•   Stoned Procedures
•   Row Alignment
•   Extreme JOINs
•   Your scheme must be printed in an A3 sheet.
•   Your ORM issue full queries for Dataset iterations
•   Hierarchical Sharding
•   Embedded lists
•   Distributed global locking
•   Throttle Control
The eternal tree
Problem: Most threaded discussion example uses something
like a table which contains all threads and answers, relating to
each other by an id. Usually the developer will come up with
his own binary-tree version to manage this mess.

id - parent_id -author - text
1 - 0 - gleicon - hello world
2 - 1 - elvis - shout !

Alternative: Document storage:
{ thread_id:1, title: 'the meeting', author: 'gleicon', replies:[
     {
       'author': elvis, text:'shout', replies:[{...}]
     }
   ]
}
Dynamic table creation
Problem: To avoid huge tables, one must come with a
"dynamic schema". For example, lets think about a document
management company, which is adding new facilities over the
country. For each storage facility, a new table is created:

item_id - row - column - stuff
1 - 10 - 20 - cat food
2 - 12 - 32 - trout

Now you have to come up with "dynamic queries", which will
probably query a "central storage" table and issue a huge join
to check if you have enough cat food over the country.

Alternatives:
- Document storage, modeling a facility as a document
- Key/Value, modeling each facility as a SET
Table as cache
Problem: Complex queries demand that a result be stored in a
separated table, so it can be queried quickly. Worst than views


Alternatives:

- Really ?

- Memcached

- Redis + AOF + EXPIRE

- De-normalization
Table as queue
Problem: A table which holds messages to be completed.
Worse, they must be ordered by
time of creation.

Corolary: Job Scheduler table

Alternatives:
- RestMQ, Resque

- Any other message broker

- Redis (LISTS - LPUSH + RPOP)

- Use the right tool
Table as log file
Problem: A table in which data gets written as a log file. From
time to time it needs to be purged. Truncating this table once
a day usually is the first task assigned to new DBAs.

Alternative:

- MongoDB capped collection

- Redis, and RRD pattern

- RIAK
Stoned procedures
Problem: Stored procedures hold most of your applications
logic. Also, some triggers are used to - well - trigger important
data events.

SP and triggers has the magic property of vanishing of our
memories and being impossible to keep versioned.

Alternative:
- Now be careful so you dont use map/reduce as modern
stoned procedures. Unfit for real time search/processing

- Use your preferred language for business stuff, and let event
handling to pub/sub or message queues.
Row Alignment
Problem: Extra rows are created but not used, just in case.
Usually they are named as a1, a2, a3, a4 and called padding.

There's good will behind that, specially when version 1 of the
software needed an extra column in a 150M lines database
and it took 2 days to run an ALTER TABLE. But that's no
excuse.

Alternative:

- Quit being cheap. Quit feeling 'hacker' about padding

- Document based databases as MongoDB and CouchDB, has
no schema. New atributes are local to the document and can
be added easily.
Extreme JOINs
Problem: Business stuff modeled as tables. Table inheritance
(Product -> SubProduct_A). To find the complete data for a
user plan, one must issue gigantic queries with lots of JOINs.

Alternative:

- Document storage, as MongoDB
  might help having important
  information together.

- De-normalization

- Serialized objects
Your scheme fits in an A3 sheet
Problem: Huge data schemes are difficult to manage. Extreme
specialization creates tables which converges to key/value
model. The normal form get priority over common sense.

Product_A       Product_B
id - desc       id - desc

Alternatives:

- De-normalization
- Another scheme ?
- Document store for flattening model
- Key/Value
- See 'Extreme JOINs'
Your ORM ...
Problem: Your ORM issue full queries for dataset iterations,
your ORM maps and creates tables which mimics your
classes, even the inheritance, and the performance is bad
because the queries are huge, etc, etc

Alternative:

- Apart from denormalization and good old common sense,
ORMs are trying to bridge two things with distinct impedance.

- There is nothing to relational models which maps cleanly to
classes and objects. Not even the basic unit which is the
domain(set) of each column. Black Magic ?
Hierarchical Sharding
Problem: Genius at work. Distinct databases inside a RDBMS
ranging from A to Z, each database has tables for users
starting with the proper letter. Each table has user data.
Fictional example: e-mail accounts management

> show databases;
abcdefghijklmnopqrstuwxz
> use a
> show tables;
...alberto alice alma ... (and a lot more)

There is no way to query anything in common for all users with
out application side processing. In this particular case this
sharding was uncalled for as relational databases have all
tools to deal with this particular case of 'different clients and
data'
Embedded Lists
Problem: As data complexity grows, one thinks that it's proper
for the application to handle different data structures
embedded in a single cell or row. The popular 'Let's use
commas to separe it'. You may find distinct separators as |, -, []
and so on.

> select group_field from that_email_sharded_database.user
"a@email1.net, b@email1.net,c@email2.net"

> select flags from stupid_email_admin where id = 2
1|0|1|1|0|

Either learn to model your data, or resort to model keys on K/V
stores. Or any other way to show up flags, as you are not
programming in C over a RDBMS hopefully.
Distributed Global Locking
Problem: Trying to emulate JAVA's synchronize in a distributed
manner. As there is no primitive architectura block to do that,
sounds like the proper place to do that would be a RDBMS.

May starts with a reference counter in a table and end up with
this:

> select COALESCE(GET_LOCK('my_lock',0 ),0 )

Plain and simple, you might find it embedded in a magic class
called DistributedSynchronize or ClusterSemaphore. Locks,
transactions and reference counters (which may act as soft
locks) doesn't belongs to the database. While its they use is
questionable even in code, the matter of fact is that you are
doing it wrong, if you are doing like that.
Throttle Control
Problem: To control and track access to a given resource, a
sequence of statements is issued, varying from a
update...select to a transaction block using a stored procedure:

> select count(access) from throttle_ctl where ip_addr like ...
> update .... or begin ... commit

Apart from having IP addresses stored as string, each request
would have to check on this block. It gets worse if throttle
control is mixed with a table-based access.

Using memcached (or any other k/v) as the data is ephemeral
would work as (after creating the entry and setting expire
time):

if (add 'IPADDR:YYYY:MM:DD:HH', 1) < your_limit: do_stuff()
ferramentas
noSQL
column
key-value             document   graph
             family
column
key-value             document   graph
             family
newSQL
noSQL   newSQL
cada escolha
    uma
  renúncia
padrões
how-to
acid
(
existe nosql
    acid
)
A word about ops
Meaningful data
• Email traffic accounting ~4500 msgs/sec in, ~2000 msgs
  out

• 10 Gb SPAM/NOT SPAM tokenized data per day

• 300 Gb logfile data/day

• 80 Billion DNS queries per day

• ~1k req/sec tcptable queries.

• 0.8 Pb of data migrated over mixed quality network.
  Planned 3mo, executed 6mo, online, on production.

• traffic from 400mb/s to 3.1Gb/s
Stuff to think about
Think if the data you use aren't de-normalized somewhere
(cached)

Most of the anti-patterns signals that there are architectural
issues instead of only database issues.

Call it NoSQL, non relational, or any other name, but assemble
a toolbox to deal with different data types.

Are you dependent on cache ? Does your application fails
when there is no warm cache ? Does it just slows down ?

Think about the way to put and to get back your data from the
database (be it SQL or NoSQL). Migrations are painful.
Stuff to think about
- Without operational requirements, it's a personal choice
about data organization
- Without time pressure, any migration is easy, regardless data
size.
- Out of production environment, events flow in a different time
flow
- 'Normal Accidents' (Charles Perrow) - how resilient is your
operation ? Are you ready to tackle your incidents ?
- Everything breaks under scale (Benjamin Black)
- "Picking up pennies in front of a steamroller" (Nassin N.
Taleb)
Perguntas?
Obrigado




github.com/porcelli                 github.com/gleicon

linkedin.com/in/alexandreporcelli   linkedin.com/in/gleicon

@porcelli                           @gleicon

porcelli.com.br                     zenmachine.wordpress.com

More Related Content

Viewers also liked

대전공의협의회 국회 발표 2015년 3월 12일
대전공의협의회 국회 발표 2015년 3월 12일대전공의협의회 국회 발표 2015년 3월 12일
대전공의협의회 국회 발표 2015년 3월 12일Kira Kira
 
Diary tanim
Diary tanimDiary tanim
Diary tanimlaolu720
 
Water conservation need of the hour
Water conservation   need of the hourWater conservation   need of the hour
Water conservation need of the hourAdane Nega
 
What's Trending Tuesday
What's Trending TuesdayWhat's Trending Tuesday
What's Trending TuesdayJosh Petersen
 
Lunch session p lakatos
Lunch session p lakatosLunch session p lakatos
Lunch session p lakatosAsszisztencia
 
Krysten\'s Poetry Anthology
Krysten\'s Poetry AnthologyKrysten\'s Poetry Anthology
Krysten\'s Poetry AnthologyRichard Lloyd
 

Viewers also liked (8)

대전공의협의회 국회 발표 2015년 3월 12일
대전공의협의회 국회 발표 2015년 3월 12일대전공의협의회 국회 발표 2015년 3월 12일
대전공의협의회 국회 발표 2015년 3월 12일
 
Diary tanim
Diary tanimDiary tanim
Diary tanim
 
Poster FINAL
Poster FINALPoster FINAL
Poster FINAL
 
Water conservation need of the hour
Water conservation   need of the hourWater conservation   need of the hour
Water conservation need of the hour
 
What's Trending Tuesday
What's Trending TuesdayWhat's Trending Tuesday
What's Trending Tuesday
 
Lunch session p lakatos
Lunch session p lakatosLunch session p lakatos
Lunch session p lakatos
 
Letmath
LetmathLetmath
Letmath
 
Krysten\'s Poetry Anthology
Krysten\'s Poetry AnthologyKrysten\'s Poetry Anthology
Krysten\'s Poetry Anthology
 

More from DNAD

LT 03 - Juan Lopes - Complexidade algoritmos
LT 03 - Juan Lopes - Complexidade algoritmosLT 03 - Juan Lopes - Complexidade algoritmos
LT 03 - Juan Lopes - Complexidade algoritmosDNAD
 
LT 05 - Ismael Apolinário - Importancia participacao cliente
LT 05 - Ismael Apolinário - Importancia participacao clienteLT 05 - Ismael Apolinário - Importancia participacao cliente
LT 05 - Ismael Apolinário - Importancia participacao clienteDNAD
 
LT 09 - Victor Cavalcante - Arquitetura não é só server side
LT 09 - Victor Cavalcante - Arquitetura não é só server sideLT 09 - Victor Cavalcante - Arquitetura não é só server side
LT 09 - Victor Cavalcante - Arquitetura não é só server sideDNAD
 
LT 08 - Guilherme Silveira - Cache hipermidia
LT 08 - Guilherme Silveira - Cache hipermidiaLT 08 - Guilherme Silveira - Cache hipermidia
LT 08 - Guilherme Silveira - Cache hipermidiaDNAD
 
LT 07 - Glauber de Almeida - DRY
LT 07 - Glauber de Almeida - DRYLT 07 - Glauber de Almeida - DRY
LT 07 - Glauber de Almeida - DRYDNAD
 
LT 06 - Douglas Aguiar - Quem nao se comunica se trumbica
LT 06 - Douglas Aguiar - Quem nao se comunica se trumbicaLT 06 - Douglas Aguiar - Quem nao se comunica se trumbica
LT 06 - Douglas Aguiar - Quem nao se comunica se trumbicaDNAD
 
LT 02 - Rodrigo Kumpera - Rodando c sharp
LT 02 - Rodrigo Kumpera - Rodando c sharpLT 02 - Rodrigo Kumpera - Rodando c sharp
LT 02 - Rodrigo Kumpera - Rodando c sharpDNAD
 
LT 01 - Rodrigo Yoshima - Business vsarchitecture
LT 01 - Rodrigo Yoshima - Business vsarchitectureLT 01 - Rodrigo Yoshima - Business vsarchitecture
LT 01 - Rodrigo Yoshima - Business vsarchitectureDNAD
 
LT 04 - Denis Ferrari - Como lidar com as dificuldades da primeira sprint - dnad
LT 04 - Denis Ferrari - Como lidar com as dificuldades da primeira sprint - dnadLT 04 - Denis Ferrari - Como lidar com as dificuldades da primeira sprint - dnad
LT 04 - Denis Ferrari - Como lidar com as dificuldades da primeira sprint - dnadDNAD
 
02a - Leandro Daniel - Examinando a arquitetura evolucionária
02a -  Leandro Daniel - Examinando a arquitetura evolucionária02a -  Leandro Daniel - Examinando a arquitetura evolucionária
02a - Leandro Daniel - Examinando a arquitetura evolucionáriaDNAD
 
09 - Fábio Akita - Além do rails
09 - Fábio Akita - Além do rails09 - Fábio Akita - Além do rails
09 - Fábio Akita - Além do railsDNAD
 
07 - Osvaldo Daibert - Cenários para cache de dados distribuidos
07  - Osvaldo Daibert - Cenários para cache de dados distribuidos07  - Osvaldo Daibert - Cenários para cache de dados distribuidos
07 - Osvaldo Daibert - Cenários para cache de dados distribuidosDNAD
 
08 - Otavio Pecego - Arquitetura e nuvem: o que muda?
08 - Otavio Pecego - Arquitetura e nuvem: o que muda?08 - Otavio Pecego - Arquitetura e nuvem: o que muda?
08 - Otavio Pecego - Arquitetura e nuvem: o que muda?DNAD
 
06 - Giovanni Bassi - CQS, CQRS, DDD, DbC, DDDD
06 - Giovanni Bassi - CQS, CQRS, DDD, DbC, DDDD06 - Giovanni Bassi - CQS, CQRS, DDD, DbC, DDDD
06 - Giovanni Bassi - CQS, CQRS, DDD, DbC, DDDDDNAD
 
05 - Waldemir Cambiucci - Matriz de habilidades de um arquiteto TI
05 - Waldemir Cambiucci - Matriz de habilidades de um arquiteto TI05 - Waldemir Cambiucci - Matriz de habilidades de um arquiteto TI
05 - Waldemir Cambiucci - Matriz de habilidades de um arquiteto TIDNAD
 
02c - Vinicius Quaiato - Over Patternization, YAGNI, KISS
02c - Vinicius Quaiato - Over Patternization, YAGNI, KISS02c - Vinicius Quaiato - Over Patternization, YAGNI, KISS
02c - Vinicius Quaiato - Over Patternization, YAGNI, KISSDNAD
 
02b - Elemar Jr. - Examinando a Arquitetura Evolucionária
02b  - Elemar Jr. - Examinando a Arquitetura Evolucionária02b  - Elemar Jr. - Examinando a Arquitetura Evolucionária
02b - Elemar Jr. - Examinando a Arquitetura EvolucionáriaDNAD
 
04 - Felipe Oliveira - Think Decoupled! (SOA)
04 - Felipe Oliveira - Think Decoupled! (SOA)04 - Felipe Oliveira - Think Decoupled! (SOA)
04 - Felipe Oliveira - Think Decoupled! (SOA)DNAD
 
01 - Giovanni Bassi - Keynote
01 - Giovanni Bassi - Keynote01 - Giovanni Bassi - Keynote
01 - Giovanni Bassi - KeynoteDNAD
 

More from DNAD (19)

LT 03 - Juan Lopes - Complexidade algoritmos
LT 03 - Juan Lopes - Complexidade algoritmosLT 03 - Juan Lopes - Complexidade algoritmos
LT 03 - Juan Lopes - Complexidade algoritmos
 
LT 05 - Ismael Apolinário - Importancia participacao cliente
LT 05 - Ismael Apolinário - Importancia participacao clienteLT 05 - Ismael Apolinário - Importancia participacao cliente
LT 05 - Ismael Apolinário - Importancia participacao cliente
 
LT 09 - Victor Cavalcante - Arquitetura não é só server side
LT 09 - Victor Cavalcante - Arquitetura não é só server sideLT 09 - Victor Cavalcante - Arquitetura não é só server side
LT 09 - Victor Cavalcante - Arquitetura não é só server side
 
LT 08 - Guilherme Silveira - Cache hipermidia
LT 08 - Guilherme Silveira - Cache hipermidiaLT 08 - Guilherme Silveira - Cache hipermidia
LT 08 - Guilherme Silveira - Cache hipermidia
 
LT 07 - Glauber de Almeida - DRY
LT 07 - Glauber de Almeida - DRYLT 07 - Glauber de Almeida - DRY
LT 07 - Glauber de Almeida - DRY
 
LT 06 - Douglas Aguiar - Quem nao se comunica se trumbica
LT 06 - Douglas Aguiar - Quem nao se comunica se trumbicaLT 06 - Douglas Aguiar - Quem nao se comunica se trumbica
LT 06 - Douglas Aguiar - Quem nao se comunica se trumbica
 
LT 02 - Rodrigo Kumpera - Rodando c sharp
LT 02 - Rodrigo Kumpera - Rodando c sharpLT 02 - Rodrigo Kumpera - Rodando c sharp
LT 02 - Rodrigo Kumpera - Rodando c sharp
 
LT 01 - Rodrigo Yoshima - Business vsarchitecture
LT 01 - Rodrigo Yoshima - Business vsarchitectureLT 01 - Rodrigo Yoshima - Business vsarchitecture
LT 01 - Rodrigo Yoshima - Business vsarchitecture
 
LT 04 - Denis Ferrari - Como lidar com as dificuldades da primeira sprint - dnad
LT 04 - Denis Ferrari - Como lidar com as dificuldades da primeira sprint - dnadLT 04 - Denis Ferrari - Como lidar com as dificuldades da primeira sprint - dnad
LT 04 - Denis Ferrari - Como lidar com as dificuldades da primeira sprint - dnad
 
02a - Leandro Daniel - Examinando a arquitetura evolucionária
02a -  Leandro Daniel - Examinando a arquitetura evolucionária02a -  Leandro Daniel - Examinando a arquitetura evolucionária
02a - Leandro Daniel - Examinando a arquitetura evolucionária
 
09 - Fábio Akita - Além do rails
09 - Fábio Akita - Além do rails09 - Fábio Akita - Além do rails
09 - Fábio Akita - Além do rails
 
07 - Osvaldo Daibert - Cenários para cache de dados distribuidos
07  - Osvaldo Daibert - Cenários para cache de dados distribuidos07  - Osvaldo Daibert - Cenários para cache de dados distribuidos
07 - Osvaldo Daibert - Cenários para cache de dados distribuidos
 
08 - Otavio Pecego - Arquitetura e nuvem: o que muda?
08 - Otavio Pecego - Arquitetura e nuvem: o que muda?08 - Otavio Pecego - Arquitetura e nuvem: o que muda?
08 - Otavio Pecego - Arquitetura e nuvem: o que muda?
 
06 - Giovanni Bassi - CQS, CQRS, DDD, DbC, DDDD
06 - Giovanni Bassi - CQS, CQRS, DDD, DbC, DDDD06 - Giovanni Bassi - CQS, CQRS, DDD, DbC, DDDD
06 - Giovanni Bassi - CQS, CQRS, DDD, DbC, DDDD
 
05 - Waldemir Cambiucci - Matriz de habilidades de um arquiteto TI
05 - Waldemir Cambiucci - Matriz de habilidades de um arquiteto TI05 - Waldemir Cambiucci - Matriz de habilidades de um arquiteto TI
05 - Waldemir Cambiucci - Matriz de habilidades de um arquiteto TI
 
02c - Vinicius Quaiato - Over Patternization, YAGNI, KISS
02c - Vinicius Quaiato - Over Patternization, YAGNI, KISS02c - Vinicius Quaiato - Over Patternization, YAGNI, KISS
02c - Vinicius Quaiato - Over Patternization, YAGNI, KISS
 
02b - Elemar Jr. - Examinando a Arquitetura Evolucionária
02b  - Elemar Jr. - Examinando a Arquitetura Evolucionária02b  - Elemar Jr. - Examinando a Arquitetura Evolucionária
02b - Elemar Jr. - Examinando a Arquitetura Evolucionária
 
04 - Felipe Oliveira - Think Decoupled! (SOA)
04 - Felipe Oliveira - Think Decoupled! (SOA)04 - Felipe Oliveira - Think Decoupled! (SOA)
04 - Felipe Oliveira - Think Decoupled! (SOA)
 
01 - Giovanni Bassi - Keynote
01 - Giovanni Bassi - Keynote01 - Giovanni Bassi - Keynote
01 - Giovanni Bassi - Keynote
 

03 - Porcelli e Gleicon - A importância dos dados em sua arquitetura… uma visão muito além do SQL Server

  • 1. A importância dos dados em sua arquitetura… uma visão muito além do SQL Server Alexandre Porcelli - @porcelli Gleicon Moraes - @gleicon
  • 2. Alexandre Porcelli Creator & Dictator Alexandre Porcelli Writer Alexandre Porcelli Organizer Alexandre Porcelli Commiter / Parser Developer Alexandre Porcelli Core Developer / API Designer
  • 3. Gleicon Moraes http://zenmachine.wordpress.com http://github.com/gleicon @gleicon
  • 4.
  • 10. nosql
  • 12.
  • 14.
  • 18. modelos • Hierarchical (IMS): late 1960’s and 1970’s • Directed graph (CODASYL): 1970’s • Relational: 1970’s and early 1980’s • Entity-Relationship: 1970’s • Extended Relational: 1980’s • Semantic: late 1970’s and 1980’s • Object-oriented: late 1980’s and early 1990’s • Object-relational: late 1980’s and early 1990’s • Semi-structured (XML): late 1990’s to late 2000’s • The next big thing: ??? ref: What Goes Around Comes Around por Michael Stonebraker e Joey Hellerstein
  • 21. abaixo ao banco de dados relacional!
  • 22. abaixo ao banco de dados relacional! como bala de prata!
  • 24.
  • 30. modelo Keyspace Família de Colunas linha chave coluna coluna coluna coluna coluna . . . coluna . . . linha chave coluna coluna coluna ... coluna Coluna nome timestamp valor
  • 33. grafo
  • 35.
  • 36.
  • 38. Architectural Anti Patterns Notes on Data Distribution and Handling Failures
  • 39. Fail
  • 40. Anti Patterns • Evolution from SQL Anti Patterns (NoSQL:br May 2010) • More than just RDBMS • Large volumes of data • Distribution • Architecture • Research on other tools • Message Queues, DHT, Job Schedulers, NoSQL • Indexing, Map/Reduce • New revision since QConSP 2010: included Hierarchical Sharding, Embedded lists and Distributed Global Locking
  • 41. RDBMS Anti Patterns Not all things fit on a relational database, single ou distributed • The eternal table-as-a-tree • Dynamic table creation • Table as cache, queue, log file • Stoned Procedures • Row Alignment • Extreme JOINs • Your scheme must be printed in an A3 sheet. • Your ORM issue full queries for Dataset iterations • Hierarchical Sharding • Embedded lists • Distributed global locking • Throttle Control
  • 42. The eternal tree Problem: Most threaded discussion example uses something like a table which contains all threads and answers, relating to each other by an id. Usually the developer will come up with his own binary-tree version to manage this mess. id - parent_id -author - text 1 - 0 - gleicon - hello world 2 - 1 - elvis - shout ! Alternative: Document storage: { thread_id:1, title: 'the meeting', author: 'gleicon', replies:[ { 'author': elvis, text:'shout', replies:[{...}] } ] }
  • 43. Dynamic table creation Problem: To avoid huge tables, one must come with a "dynamic schema". For example, lets think about a document management company, which is adding new facilities over the country. For each storage facility, a new table is created: item_id - row - column - stuff 1 - 10 - 20 - cat food 2 - 12 - 32 - trout Now you have to come up with "dynamic queries", which will probably query a "central storage" table and issue a huge join to check if you have enough cat food over the country. Alternatives: - Document storage, modeling a facility as a document - Key/Value, modeling each facility as a SET
  • 44. Table as cache Problem: Complex queries demand that a result be stored in a separated table, so it can be queried quickly. Worst than views Alternatives: - Really ? - Memcached - Redis + AOF + EXPIRE - De-normalization
  • 45. Table as queue Problem: A table which holds messages to be completed. Worse, they must be ordered by time of creation. Corolary: Job Scheduler table Alternatives: - RestMQ, Resque - Any other message broker - Redis (LISTS - LPUSH + RPOP) - Use the right tool
  • 46. Table as log file Problem: A table in which data gets written as a log file. From time to time it needs to be purged. Truncating this table once a day usually is the first task assigned to new DBAs. Alternative: - MongoDB capped collection - Redis, and RRD pattern - RIAK
  • 47. Stoned procedures Problem: Stored procedures hold most of your applications logic. Also, some triggers are used to - well - trigger important data events. SP and triggers has the magic property of vanishing of our memories and being impossible to keep versioned. Alternative: - Now be careful so you dont use map/reduce as modern stoned procedures. Unfit for real time search/processing - Use your preferred language for business stuff, and let event handling to pub/sub or message queues.
  • 48. Row Alignment Problem: Extra rows are created but not used, just in case. Usually they are named as a1, a2, a3, a4 and called padding. There's good will behind that, specially when version 1 of the software needed an extra column in a 150M lines database and it took 2 days to run an ALTER TABLE. But that's no excuse. Alternative: - Quit being cheap. Quit feeling 'hacker' about padding - Document based databases as MongoDB and CouchDB, has no schema. New atributes are local to the document and can be added easily.
  • 49. Extreme JOINs Problem: Business stuff modeled as tables. Table inheritance (Product -> SubProduct_A). To find the complete data for a user plan, one must issue gigantic queries with lots of JOINs. Alternative: - Document storage, as MongoDB might help having important information together. - De-normalization - Serialized objects
  • 50. Your scheme fits in an A3 sheet Problem: Huge data schemes are difficult to manage. Extreme specialization creates tables which converges to key/value model. The normal form get priority over common sense. Product_A Product_B id - desc id - desc Alternatives: - De-normalization - Another scheme ? - Document store for flattening model - Key/Value - See 'Extreme JOINs'
  • 51. Your ORM ... Problem: Your ORM issue full queries for dataset iterations, your ORM maps and creates tables which mimics your classes, even the inheritance, and the performance is bad because the queries are huge, etc, etc Alternative: - Apart from denormalization and good old common sense, ORMs are trying to bridge two things with distinct impedance. - There is nothing to relational models which maps cleanly to classes and objects. Not even the basic unit which is the domain(set) of each column. Black Magic ?
  • 52. Hierarchical Sharding Problem: Genius at work. Distinct databases inside a RDBMS ranging from A to Z, each database has tables for users starting with the proper letter. Each table has user data. Fictional example: e-mail accounts management > show databases; abcdefghijklmnopqrstuwxz > use a > show tables; ...alberto alice alma ... (and a lot more) There is no way to query anything in common for all users with out application side processing. In this particular case this sharding was uncalled for as relational databases have all tools to deal with this particular case of 'different clients and data'
  • 53. Embedded Lists Problem: As data complexity grows, one thinks that it's proper for the application to handle different data structures embedded in a single cell or row. The popular 'Let's use commas to separe it'. You may find distinct separators as |, -, [] and so on. > select group_field from that_email_sharded_database.user "a@email1.net, b@email1.net,c@email2.net" > select flags from stupid_email_admin where id = 2 1|0|1|1|0| Either learn to model your data, or resort to model keys on K/V stores. Or any other way to show up flags, as you are not programming in C over a RDBMS hopefully.
  • 54. Distributed Global Locking Problem: Trying to emulate JAVA's synchronize in a distributed manner. As there is no primitive architectura block to do that, sounds like the proper place to do that would be a RDBMS. May starts with a reference counter in a table and end up with this: > select COALESCE(GET_LOCK('my_lock',0 ),0 ) Plain and simple, you might find it embedded in a magic class called DistributedSynchronize or ClusterSemaphore. Locks, transactions and reference counters (which may act as soft locks) doesn't belongs to the database. While its they use is questionable even in code, the matter of fact is that you are doing it wrong, if you are doing like that.
  • 55. Throttle Control Problem: To control and track access to a given resource, a sequence of statements is issued, varying from a update...select to a transaction block using a stored procedure: > select count(access) from throttle_ctl where ip_addr like ... > update .... or begin ... commit Apart from having IP addresses stored as string, each request would have to check on this block. It gets worse if throttle control is mixed with a table-based access. Using memcached (or any other k/v) as the data is ephemeral would work as (after creating the entry and setting expire time): if (add 'IPADDR:YYYY:MM:DD:HH', 1) < your_limit: do_stuff()
  • 57. noSQL
  • 58. column key-value document graph family
  • 59. column key-value document graph family
  • 60.
  • 62.
  • 63.
  • 64.
  • 65. noSQL newSQL
  • 66.
  • 67.
  • 68. cada escolha uma renúncia
  • 71.
  • 72. acid
  • 73. (
  • 74. existe nosql acid
  • 75. )
  • 77. Meaningful data • Email traffic accounting ~4500 msgs/sec in, ~2000 msgs out • 10 Gb SPAM/NOT SPAM tokenized data per day • 300 Gb logfile data/day • 80 Billion DNS queries per day • ~1k req/sec tcptable queries. • 0.8 Pb of data migrated over mixed quality network. Planned 3mo, executed 6mo, online, on production. • traffic from 400mb/s to 3.1Gb/s
  • 78. Stuff to think about Think if the data you use aren't de-normalized somewhere (cached) Most of the anti-patterns signals that there are architectural issues instead of only database issues. Call it NoSQL, non relational, or any other name, but assemble a toolbox to deal with different data types. Are you dependent on cache ? Does your application fails when there is no warm cache ? Does it just slows down ? Think about the way to put and to get back your data from the database (be it SQL or NoSQL). Migrations are painful.
  • 79. Stuff to think about - Without operational requirements, it's a personal choice about data organization - Without time pressure, any migration is easy, regardless data size. - Out of production environment, events flow in a different time flow - 'Normal Accidents' (Charles Perrow) - how resilient is your operation ? Are you ready to tackle your incidents ? - Everything breaks under scale (Benjamin Black) - "Picking up pennies in front of a steamroller" (Nassin N. Taleb)
  • 81. Obrigado github.com/porcelli github.com/gleicon linkedin.com/in/alexandreporcelli linkedin.com/in/gleicon @porcelli @gleicon porcelli.com.br zenmachine.wordpress.com