summaryrefslogtreecommitdiff
path: root/README.md
blob: ed05cc9d9cd90096098d73c304feee712a1d4b40 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# HTML5-PHP

This is a **highly experimental** HTML5 Parser.

The need for an HTML5 parser in PHP is clear. This project initially
began with the seemingly abandoned `html5lib` project [original source](https://code.google.com/p/html5lib/source/checkout).
But after some initial refactoring work, we began a new parser.

- An HTML5 serializer [in progress; early alpha]
- Support for PHP namespace [done]
- Composer support [in progress]
- Event-based (SAX-like) parser [in progress]
- DOM tree builder [in progress]
- Interoperability with QueryPath [not started]

## Basic Usage

HTML5-PHP has a high-level API and a low-level API. 

Here is how you use the high-level `HTML5` library API:

```php
<?php
// Assuming you installed from Composer:
require "vendor/autoload.php";


// An example HTML document:
$html = <<< 'HERE'
  <html>
  <head>
    <title>TEST</title>
  </head>
  <body id='foo'>
    <h1>Hello World</h1>
    <p>This is a test of the HTML5 parser.</p>
  </body>
  </html>
HERE;

// Parse the document. $dom is a DOMDocument.
$dom = HTML5::parse($html);

// Render it as HTML5:
print HTML5::saveHTML($dom);

// Or save it to a file:
HTML5::save('out.html');

?>
```

The `$dom` created by the parser is a full `DOMDocument` object. And the
`save()` and `saveHTML()` methods will take any DOMDocument.


### The Low-Level API

This library provides the following low-level APIs that you can use to
create more customized HTML5 tools:

- An `InputStream` abstraction that can work with different kinds of
input source (not just files and strings).
- A SAX-like event-based parser that you can hook into for special kinds
of parsing.
- A flexible error-reporting mechanism that can be tuned to document
syntax checking.
- A DOM implementation that uses PHP's built-in DOM library.

The unit tests exercise each piece of the API, and every public function
is well-documented.

### Parser Design

The parser is designed as follows:

- The `InputStream` portion handles direct I/O.
- The `Scanner` handles scanning on behalf of the parser.
- The `Tokenizer` requests data off of the scanner, parses it, clasifies
it, and sends it to an `EventHandler`. It is a *recursive descent parser.*
- The `EventHandler` receives notifications and data for each specific
semantic event that occurs during tokenization.
- The `DOMBuilder` is an `EventHandler` that listens for tokenizing
events and builds a document tree (`DOMDocument`) based on the events.

### Serializer Design

The serializer takes a data structure (the `DOMDocument`) and transforms
it into a character representation -- an HTML5 document.

The serializer is broken into two parts:

- The `Traverser`, which is a special-purpose tree walker. It visits
each node and transforms it into a string.
- The `Serializer` manages the `Traverser` and stores the resultant data
in the correct place.

The serializer (`save()`, `saveHTML()`) follows the 
[section 8.9 of the HTML 5.0 spec] (http://www.w3.org/TR/2012/CR-html5-20121217/syntax.html#serializing-html-fragments).
So tags are serialized according to these rules:

- A tag with children: &lt;foo&gt;CHILDREN&lt;/foo&gt;
- A tag that cannot have content: &lt;foo&gt; (no closing tag)
- A tag that could have content, but doesn't: &lt;foo&gt;&lt;/foo&gt;

## Thanks to...

We owe a huge debt of gratitude to the original authors of html5lib.

While not much of the orignal parser remains, we learned a lot from
reading the html5lib library. And some pieces remain here. In
particular, much of the UTF-8 and Unicode handling is derived from the
html5lib project.

## License

This software is released under the MIT license. The original html5lib
library was also released under the MIT license.