summaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorAndres Rey <[email protected]>2017-07-09 11:32:02 +0100
committerAndres Rey <[email protected]>2017-07-09 11:32:02 +0100
commit8626e09f7bb3f282f47802326c53a05edf820f8a (patch)
treeadb18c6a22a7a4c1ba3f8bb9e8d2582bda26addd /README.md
parent451cd8d7705c7f5fbf655764382320ddf2a29ea3 (diff)
Initial correction of new test cases
Diffstat (limited to 'README.md')
-rw-r--r--README.md12
1 files changed, 9 insertions, 3 deletions
diff --git a/README.md b/README.md
index 5c98178..8da8a17 100644
--- a/README.md
+++ b/README.md
@@ -88,11 +88,17 @@ Readability uses the Element interface and class from *The PHP League's* **[html
## To-do
-100% of the original readability code was ported, at least until the last commit when I started this project ([13 Aug 2016](https://github.com/mozilla/readability/commit/71aa562387fa507b0bac30ae7144e1df7ba8a356)). There are a lot of `TODO`s around the code, which are the part that need to be finished.
-
- Right now the Readability object is an extension of the Element object of html-to-markdown. This is a problem because you lose context. The scoring when creating a new Readability object must be reloaded manually. The DOMDocument object is consistent across the same document. You change one value here and that will update all other nodes in other variables. By using the element interface you lose that reference and the score must be restored manually. Ideally, the Readability object should be an extension of the DOMDocument or DOMElement objects, the score should be saved within that object and no restoration or recalculation would be needed.
- There are a lot of problems with responsabilities. Right now there are two classes: HTMLParser and Readability. HTMLParser does a lot of things that should be a responsibility of Readability. It also does a lot of things that should be part of another class, specially when building the final article DOMDocument.
## How it works
-Readability parses all the text with DOMDocument, scans the text nodes and gives the a score, based on the amount of words, links and type of element. Then it selects the highest scoring element and creates a new DOMDocument with all its siblings. Each sibling is scored to discard useless elements, like nav bars, empty nodes, etc. \ No newline at end of file
+Readability parses all the text with DOMDocument, scans the text nodes and gives the a score, based on the amount of words, links and type of element. Then it selects the highest scoring element and creates a new DOMDocument with all its siblings. Each sibling is scored to discard useless elements, like nav bars, empty nodes, etc.
+
+## Code porting
+
+Current version follows the latest version of readability.js as of [05 May 2015](https://github.com/mozilla/readability/commit/f0edc77cb58ef52890e3065cf2b0e334d940feb2).
+
+### TO-DOs of the current port:
+
+ - Port `_cleanStyles` to avoid style attributes inside other tags (like `<p style="hello ">`) \ No newline at end of file