SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
BBS Crawler
     for Taiwan

bsdconv + pyte + telnetlib


 by Buganini @ PyHUG
      Sep. 2012
Obstacles
●   Big5/UAO
●   Segmented Big5
●   Control Sequence
●   Ambiguous Width
●   Big5/UAO           Gov.tw: BIG5-2003

●   Segmented Big5     Windows: CP950
●   Control Sequence   Libiconv: BIG5(?), CP950, BIG5-HKSCS,
                          BIG5-HKSCS:2004, BIG5-HKSCS:2001,
●   Ambiguous Width       BIG5-HKSCS:1999, BIG5-2003 (experimental)

                       Mozilla: UAO 2.41

                       BBS: UAO 2.50(?)

                                etc..   ref: http://moztw.org/docs/big5/


                       UAO
                          == Unicode At Once
                          == Unicode 補完計畫
                          != Unicode

                       UAO
                          is extended Big5 (by using PUA),
                          including Chinese (trad/sim/hk), Japanese, Cyrillic

                          Ex: 喆 (95ED), 轮 (8879), Я(C854), か (C6F1)
Big5/UAO
                       xAExE1
●



●   Segmented Big5
●   Control Sequence   xAE
●   Ambiguous Width    x1B[1;33m
                       xE1

                             PCMAN

                       Standard Tool
●   Big5/UAO
●   Segmented Big5
●   Control Sequence
●   Ambiguous Width




                       08 08 20 20   ← ← SP SP
                       08 08 0a      ←←↓
                       e2 97 8f      ●
●   Big5/UAO
●   Segmented Big5
●   Control Sequence
●   Ambiguous Width
Obstacles
                                             Not anymore…

●   Big5/UAO
●   Segmented Big5                    Solved in bug5, using bsdconv

●   Ambiguous Width
●   Control Sequence                  Solved, using pyte




https://github.com/buganini/bug5

https://github.com/buganini/bsdconv

https://github.com/selectel/pyte
bsdconv                           (1/4)
import bsdconv

bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")


                                 xAExE1xAEx1B[1;33mxE1
                         ---------------------------------------------------------
                             AE E1 AE 1B 5B 31 3B 33 33 6D E1

     ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★
                                                                                     Bsdconv Internal Prefix:
                          03AE 03E1 03AE 1B5B313B33336D 03E1                         03: Byte
                                                                                     1B: ANSI Control Sequence
     ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★

                          03AE 03E1 03AE 03E1 1B5B313B33336D


   ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★

                             AE E1 AE E1 1B 5B 31 3B 33 33 6D

     ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★

                              016851 016851 1B5B313B33336D                           #U+6851 == 桑
bsdconv                      (2/4)
 import bsdconv

 bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")


>>> c=bsdconv.Bsdconv("ansi-control,byte:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
03AE
03E1
03AE
1B5B313B33336D ( FREE )
03E1

>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
03AE
03E1
03AE
03E1
1B5B313B33336D ( FREE )
Bsdconv Internal Prefix:
03: Byte
1B: ANSI Control Sequence
bsdconv                      (3/4)
 import bsdconv

 bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")


>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|
   pass:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
AE
E1
AE
E1
1B5B313B33336D ( FREE SKIP )

>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|
   skip,big5:bsdconv_stdout")
>>> c.conv("xAExE1xAEx1B[1;33mxE1")
016851
016851
1B5B313B33336D ( FREE )
Bsdconv Internal Prefix:
01: Unicode
1B: ANSI Control Sequence

#U+6851 == 桑
bsdconv                      (4/4)
import bsdconv

bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw")


>>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|
   skip,big5:utf-8,bsdconv_raw")
>>> s=c.conv("xAExE1xAEx1B[1;33mxE1")

>>> s
'xe6xa1x91xe6xa1x91x1b[1;33m'

>>> s.decode("utf-8")
u'u6851u6851x1b[1;33m'




#U+6851 == 桑
_
                                                                           | |

                                    pyte       (1/2)
                                                           _ __    _   _ | |_    ___
                                                          | '_  | | | || __|/ _ 
                                                          | |_) || |_| || |_|     __/
import pyte
                                                          | .__/   __, | __|___|
stream = pyte.Stream()                                    | |      __/ |
                                                          |_|      |___/
screen = pyte.Screen(80, 24)
                                                          Python Terminal Emulator
screen.mode.discard(pyte.modes.LNM)

stream.attach(screen)

seq=SEQUENCE_FROM_SERVER

useq=c.conv(seq)

stream.feed(useq.decode("utf-8"))

RESULT_SCREEN="n".join(screen.display).encode("utf-8")




 With pyte.modes.LNM:
 r → CR+LF (CarriageReturn / LineFeed)
 Without pyte.modes.LNM:
 r → CR
pyte           (2/2)
                                   #Ambiguous Width
screens.py

width_counter=bsdconv.Bsdconv("utf-8:width:null")
telnetlib           (1/3)




What's wrong with read_until/expect?
  What telnetlib does:
    Server → telnetlib connection→ telnetlib.read_until

  What I need:
    Server → telnetlib connection → bsdconv → telnetlib.read_until
    Regular Expression

Solutions:
  a) Implement bsdconv → telnetlib.read_until (current)
  b) Hack telnetlib (maybe cleaner)
  c) Other telnetlib implementation?
telnetlib             (2/3)
                    #Deal with lagging/noop
def term_comm(feed=None, wait=None):
   if feed!=None:
        conn.write(feed)
        if wait:
            s=conn.read_some()
            s=conv.conv_chunk(s)
            stream.feed(s.decode("utf-8"))
   if wait!=False:
        time.sleep(0.1)
        s=conn.read_very_eager()
        s=conv.conv_chunk(s)
        stream.feed(s.decode("utf-8"))
   ret="n".join(screen.display).encode("utf-8")
   return ret

       Reading                   Feed                     No Feed
     Wait=None               Non-blocking               Non-blocking
      Wait=True                Blocking             Non-blocking (unused)
     Wait=False                   No                         No
telnetlib            (3/3)
                  #Deal with lagging/noop
Action with or without screen refresh
   term_comm('Action A', False)
   term_comm('Action B', True)
   #Action A+B cause screen refresh

Action with screen refresh (important content)
   term_comm('Action', True)

Action with screen refresh
   term_comm('Action')

Wait+Retry



      Reading                 Feed                     No Feed
    Wait=None             Non-blocking               Non-blocking
     Wait=True               Blocking            Non-blocking (unused)
     Wait=False                No                         No
- Demo -
- End -

Weitere ähnliche Inhalte

Was ist angesagt?

EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5PRADEEP
 
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Hsien-Hsin Sean Lee, Ph.D.
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Mr. Vengineer
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerLinaro
 
C++20 the small things - Timur Doumler
C++20 the small things - Timur DoumlerC++20 the small things - Timur Doumler
C++20 the small things - Timur Doumlercorehard_by
 
assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUEducation
 
Autovectorization in llvm
Autovectorization in llvmAutovectorization in llvm
Autovectorization in llvmChangWoo Min
 
verilog code for logic gates
verilog code for logic gatesverilog code for logic gates
verilog code for logic gatesRakesh kumar jha
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersMarina Kolpakova
 
N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)Selomon birhane
 

Was ist angesagt? (20)

EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5EMBEDDED SYSTEMS 4&5
EMBEDDED SYSTEMS 4&5
 
Quiz 9
Quiz 9Quiz 9
Quiz 9
 
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
Lec9 Intro to Computer Engineering by Hsien-Hsin Sean Lee Georgia Tech -- Com...
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。Tiramisu をちょっと、味見してみました。
Tiramisu をちょっと、味見してみました。
 
Ch9c
Ch9cCh9c
Ch9c
 
Ch9a
Ch9aCh9a
Ch9a
 
Machine Trace Metrics
Machine Trace MetricsMachine Trace Metrics
Machine Trace Metrics
 
Summary of C++17 features
Summary of C++17 featuresSummary of C++17 features
Summary of C++17 features
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-Vectorizer
 
C++20 the small things - Timur Doumler
C++20 the small things - Timur DoumlerC++20 the small things - Timur Doumler
C++20 the small things - Timur Doumler
 
Stack
StackStack
Stack
 
assembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YUassembly language programming and organization of IBM PC" by YTHA YU
assembly language programming and organization of IBM PC" by YTHA YU
 
Autovectorization in llvm
Autovectorization in llvmAutovectorization in llvm
Autovectorization in llvm
 
verilog code for logic gates
verilog code for logic gatesverilog code for logic gates
verilog code for logic gates
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 
Dataflow Analysis
Dataflow AnalysisDataflow Analysis
Dataflow Analysis
 
Ch9b
Ch9bCh9b
Ch9b
 
Code GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limitersCode GPU with CUDA - Identifying performance limiters
Code GPU with CUDA - Identifying performance limiters
 
N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)N_Asm Assembly arithmetic instructions (sol)
N_Asm Assembly arithmetic instructions (sol)
 

Ähnlich wie BBS crawler for Taiwan

Kernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentKernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentAnne Nicolas
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-optJeff Larkin
 
Day2 Verilog HDL Basic
Day2 Verilog HDL BasicDay2 Verilog HDL Basic
Day2 Verilog HDL BasicRon Liu
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Community
 
Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)yang firo
 
Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)yang firo
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Windbg랑 친해지기
Windbg랑 친해지기Windbg랑 친해지기
Windbg랑 친해지기Ji Hun Kim
 
Verilog Lecture4 2014
Verilog Lecture4 2014Verilog Lecture4 2014
Verilog Lecture4 2014Béo Tú
 
LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)Wang Hsiangkai
 
lecture8_Cuong.ppt
lecture8_Cuong.pptlecture8_Cuong.ppt
lecture8_Cuong.pptHongV34104
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...Positive Hack Days
 
Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Dev_Events
 
淺入淺出 GDB
淺入淺出 GDB淺入淺出 GDB
淺入淺出 GDBJim Chang
 
Bytes in the Machine: Inside the CPython interpreter
Bytes in the Machine: Inside the CPython interpreterBytes in the Machine: Inside the CPython interpreter
Bytes in the Machine: Inside the CPython interpreterakaptur
 
Verilog Lecture3 hust 2014
Verilog Lecture3 hust 2014Verilog Lecture3 hust 2014
Verilog Lecture3 hust 2014Béo Tú
 
리눅스 드라이버 실습 #3
리눅스 드라이버 실습 #3리눅스 드라이버 실습 #3
리눅스 드라이버 실습 #3Sangho Park
 

Ähnlich wie BBS crawler for Taiwan (20)

Kernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel developmentKernel Recipes 2016 - Why you need a test strategy for your kernel development
Kernel Recipes 2016 - Why you need a test strategy for your kernel development
 
May2010 hex-core-opt
May2010 hex-core-optMay2010 hex-core-opt
May2010 hex-core-opt
 
Performance tests - it's a trap
Performance tests - it's a trapPerformance tests - it's a trap
Performance tests - it's a trap
 
Day2 Verilog HDL Basic
Day2 Verilog HDL BasicDay2 Verilog HDL Basic
Day2 Verilog HDL Basic
 
Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph Ceph Day Melbourne - Troubleshooting Ceph
Ceph Day Melbourne - Troubleshooting Ceph
 
Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)Linux kernel debugging(PDF format)
Linux kernel debugging(PDF format)
 
Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)Linux kernel debugging(ODP format)
Linux kernel debugging(ODP format)
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Windbg랑 친해지기
Windbg랑 친해지기Windbg랑 친해지기
Windbg랑 친해지기
 
Verilog Lecture4 2014
Verilog Lecture4 2014Verilog Lecture4 2014
Verilog Lecture4 2014
 
Operating System
Operating SystemOperating System
Operating System
 
LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)LLVM Register Allocation (2nd Version)
LLVM Register Allocation (2nd Version)
 
lecture8_Cuong.ppt
lecture8_Cuong.pptlecture8_Cuong.ppt
lecture8_Cuong.ppt
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
 
Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...Secrets of building a debuggable runtime: Learn how language implementors sol...
Secrets of building a debuggable runtime: Learn how language implementors sol...
 
淺入淺出 GDB
淺入淺出 GDB淺入淺出 GDB
淺入淺出 GDB
 
Bytes in the Machine: Inside the CPython interpreter
Bytes in the Machine: Inside the CPython interpreterBytes in the Machine: Inside the CPython interpreter
Bytes in the Machine: Inside the CPython interpreter
 
Verilog Lecture3 hust 2014
Verilog Lecture3 hust 2014Verilog Lecture3 hust 2014
Verilog Lecture3 hust 2014
 
리눅스 드라이버 실습 #3
리눅스 드라이버 실습 #3리눅스 드라이버 실습 #3
리눅스 드라이버 실습 #3
 
Ansible 2.0 spblug
Ansible 2.0 spblugAnsible 2.0 spblug
Ansible 2.0 spblug
 

Kürzlich hochgeladen

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Kürzlich hochgeladen (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

BBS crawler for Taiwan

  • 1. BBS Crawler for Taiwan bsdconv + pyte + telnetlib by Buganini @ PyHUG Sep. 2012
  • 2. Obstacles ● Big5/UAO ● Segmented Big5 ● Control Sequence ● Ambiguous Width
  • 3. Big5/UAO Gov.tw: BIG5-2003 ● Segmented Big5 Windows: CP950 ● Control Sequence Libiconv: BIG5(?), CP950, BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001, ● Ambiguous Width BIG5-HKSCS:1999, BIG5-2003 (experimental) Mozilla: UAO 2.41 BBS: UAO 2.50(?) etc.. ref: http://moztw.org/docs/big5/ UAO == Unicode At Once == Unicode 補完計畫 != Unicode UAO is extended Big5 (by using PUA), including Chinese (trad/sim/hk), Japanese, Cyrillic Ex: 喆 (95ED), 轮 (8879), Я(C854), か (C6F1)
  • 4. Big5/UAO xAExE1 ● ● Segmented Big5 ● Control Sequence xAE ● Ambiguous Width x1B[1;33m xE1 PCMAN Standard Tool
  • 5. Big5/UAO ● Segmented Big5 ● Control Sequence ● Ambiguous Width 08 08 20 20 ← ← SP SP 08 08 0a ←←↓ e2 97 8f ●
  • 6. Big5/UAO ● Segmented Big5 ● Control Sequence ● Ambiguous Width
  • 7. Obstacles Not anymore… ● Big5/UAO ● Segmented Big5 Solved in bug5, using bsdconv ● Ambiguous Width ● Control Sequence Solved, using pyte https://github.com/buganini/bug5 https://github.com/buganini/bsdconv https://github.com/selectel/pyte
  • 8. bsdconv (1/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") xAExE1xAEx1B[1;33mxE1 --------------------------------------------------------- AE E1 AE 1B 5B 31 3B 33 33 6D E1 ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ Bsdconv Internal Prefix: 03AE 03E1 03AE 1B5B313B33336D 03E1 03: Byte 1B: ANSI Control Sequence ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ 03AE 03E1 03AE 03E1 1B5B313B33336D ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ AE E1 AE E1 1B 5B 31 3B 33 33 6D ★ ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw ★ 016851 016851 1B5B313B33336D #U+6851 == 桑
  • 9. bsdconv (2/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") >>> c=bsdconv.Bsdconv("ansi-control,byte:bsdconv_stdout") >>> c.conv("xAExE1xAEx1B[1;33mxE1") 03AE 03E1 03AE 1B5B313B33336D ( FREE ) 03E1 >>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:bsdconv_stdout") >>> c.conv("xAExE1xAEx1B[1;33mxE1") 03AE 03E1 03AE 03E1 1B5B313B33336D ( FREE ) Bsdconv Internal Prefix: 03: Byte 1B: ANSI Control Sequence
  • 10. bsdconv (3/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") >>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| pass:bsdconv_stdout") >>> c.conv("xAExE1xAEx1B[1;33mxE1") AE E1 AE E1 1B5B313B33336D ( FREE SKIP ) >>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| skip,big5:bsdconv_stdout") >>> c.conv("xAExE1xAEx1B[1;33mxE1") 016851 016851 1B5B313B33336D ( FREE ) Bsdconv Internal Prefix: 01: Unicode 1B: ANSI Control Sequence #U+6851 == 桑
  • 11. bsdconv (4/4) import bsdconv bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control|skip,big5:utf-8,bsdconv_raw") >>> c=bsdconv.Bsdconv("ansi-control,byte:big5-defrag:byte,ansi-control| skip,big5:utf-8,bsdconv_raw") >>> s=c.conv("xAExE1xAEx1B[1;33mxE1") >>> s 'xe6xa1x91xe6xa1x91x1b[1;33m' >>> s.decode("utf-8") u'u6851u6851x1b[1;33m' #U+6851 == 桑
  • 12. _ | | pyte (1/2) _ __ _ _ | |_ ___ | '_ | | | || __|/ _ | |_) || |_| || |_| __/ import pyte | .__/ __, | __|___| stream = pyte.Stream() | | __/ | |_| |___/ screen = pyte.Screen(80, 24) Python Terminal Emulator screen.mode.discard(pyte.modes.LNM) stream.attach(screen) seq=SEQUENCE_FROM_SERVER useq=c.conv(seq) stream.feed(useq.decode("utf-8")) RESULT_SCREEN="n".join(screen.display).encode("utf-8") With pyte.modes.LNM: r → CR+LF (CarriageReturn / LineFeed) Without pyte.modes.LNM: r → CR
  • 13. pyte (2/2) #Ambiguous Width screens.py width_counter=bsdconv.Bsdconv("utf-8:width:null")
  • 14. telnetlib (1/3) What's wrong with read_until/expect? What telnetlib does: Server → telnetlib connection→ telnetlib.read_until What I need: Server → telnetlib connection → bsdconv → telnetlib.read_until Regular Expression Solutions: a) Implement bsdconv → telnetlib.read_until (current) b) Hack telnetlib (maybe cleaner) c) Other telnetlib implementation?
  • 15. telnetlib (2/3) #Deal with lagging/noop def term_comm(feed=None, wait=None): if feed!=None: conn.write(feed) if wait: s=conn.read_some() s=conv.conv_chunk(s) stream.feed(s.decode("utf-8")) if wait!=False: time.sleep(0.1) s=conn.read_very_eager() s=conv.conv_chunk(s) stream.feed(s.decode("utf-8")) ret="n".join(screen.display).encode("utf-8") return ret Reading Feed No Feed Wait=None Non-blocking Non-blocking Wait=True Blocking Non-blocking (unused) Wait=False No No
  • 16. telnetlib (3/3) #Deal with lagging/noop Action with or without screen refresh term_comm('Action A', False) term_comm('Action B', True) #Action A+B cause screen refresh Action with screen refresh (important content) term_comm('Action', True) Action with screen refresh term_comm('Action') Wait+Retry Reading Feed No Feed Wait=None Non-blocking Non-blocking Wait=True Blocking Non-blocking (unused) Wait=False No No