summaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorAndres Rey <[email protected]>2016-11-25 23:59:40 +0000
committerAndres Rey <[email protected]>2016-11-25 23:59:40 +0000
commitf0b558c72305c98047e1bf63c6f39af975b19caf (patch)
treea17cb84b177b3173b6800ecbdfbab151466ef4f9 /README.md
parent7a3716fcbf696bc92ba8914f32e1dfa85301b9ac (diff)
Updated readme for release. Added cleanheaders.
Diffstat (limited to 'README.md')
-rw-r--r--README.md27
1 files changed, 24 insertions, 3 deletions
diff --git a/README.md b/README.md
index f7d910a..93956e7 100644
--- a/README.md
+++ b/README.md
@@ -48,16 +48,37 @@ Of course the main limitation is PHP. Websites that load the content through laz
## Known Issues
-None so far.
+DOMDocument has some issues while parsing javascript with unescaped HTML on strings. Consider the following code:
+
+```html
+<div> <!-- Offending div without closing tag -->
+<script type="text/javascript">
+ var test = '</div>';
+ // I should not appear on the result
+</script>
+```
-## To-do
+If you would like to remove the scripts of the HTML (like readability does), you would expect ending up with just one div and one comment on the final HTML. The problem is that libxml takes that closing div tag inside the javascript string as a HTML tag, effectively closing the unclosed tag and leaving the rest of the javascript as a string withing a P tag. If you save that node, the final HTML will end up like this:
-100% of the original readability code was ported, at least until the last commit when I started this project ([13 Aug 2016](https://github.com/mozilla/readability/commit/71aa562387fa507b0bac30ae7144e1df7ba8a356)). There are a lot of `TODO`s around the code, which are the part that need to be finished.
+```html
+<div> <!-- Offending div without closing tag -->
+<p>';
+ // I should not appear on the result
+</p></div>
+```
+
+This is a libxml issue and not a Readability.php bug.
## Dependencies
Readability uses the Element interface and class from *The PHP League's* **[html-to-markdown](https://github.com/thephpleague/html-to-markdown/)**. The Readability object is an extension of the Element class. It overrides some methods but relies on it for basic DOMElement parsing.
+## To-do
+
+100% of the original readability code was ported, at least until the last commit when I started this project ([13 Aug 2016](https://github.com/mozilla/readability/commit/71aa562387fa507b0bac30ae7144e1df7ba8a356)). There are a lot of `TODO`s around the code, which are the part that need to be finished.
+
+- Right now the Readability object is an extension of the Element object of html-to-markdown. This is a problem because: 1) you lose the scoring when creating a new Readability object. The DOMDocument object is consistent across the same document. You change one value here and that will update all other nodes in other variables. By using the element interface you lose that reference and the score must be restored manually. Ideally, the Readability object should be an extension of the DOMDocument or DOMElement objects, the score should be saved within that object and no restoration or recalculation would be needed.
+
## How it works
Readability parses all the text with DOMDocument, scans the text nodes and gives the a score, based on the amount of words, links and type of element. Then it selects the highest scoring element and creates a new DOMDocument with all its siblings. Each sibling is scored to discard useless elements, like nav bars, empty nodes, etc. \ No newline at end of file