# Readability.php [![Latest Stable Version](https://poser.pugx.org/andreskrey/readability.php/v/stable)](https://packagist.org/packages/andreskrey/readability.php) [![StyleCI](https://styleci.io/repos/71042668/shield?branch=master)](https://styleci.io/repos/71042668) [![Build Status](https://travis-ci.org/andreskrey/readability.php.svg?branch=master)](https://travis-ci.org/andreskrey/readability.php) PHP port of *Mozilla's* **[Readability.js](https://github.com/mozilla/readability)**. Parses html text (usually news and other articles) and tries to return title, byline and text content. Analizes each text node, gives an score and orders them based on this calculation. **Requires**: PHP 5.4+ & DOMDocument (libxml) **Lead Developer**: Andres Rey ## Status Current status is *ultra-mega-alpha*. It is broken right now and it will change dramatically until the first 1.0 release. Expect wild changes. Submit pull requests. Argue with me. ## How to use it First you have to require the library using composer: `composer require andreskrey/readability.php` Then, create and HTMLParser object with your preferences, feed the `parse()` function with your HTML and check the resulting array: ```php use andreskrey\Readability\HTMLParser; $readability = new HTMLParser(); $html = file_get_contents('http://your.favorite.newspaper/article.html'); $result = $readability->parse($html); ``` The `$result` variable now will hold the following information: ``` $result = [ 'title' => 'Title of the article', 'author' => 'Name of the author of the article', 'image' => 'Main image of the article', 'article' => 'DOMDocument with the full article text, scored and parsed' ] ``` If the parsing process was unsuccessful the HTMLParser will return `false` ## Options - **maxTopCandidates**: default value `5`, max amount of top level candidates. - **articleByLine**: default value `false`, search for the article byline. - **stripUnlikelyCandidates**: default value `true`, remove nodes that are unlikely to have relevant information. Useful for debugging or parsing complex or non-standard articles. - **cleanConditionally**: default value `true`, remove certain nodes after parsing to return a cleaner result. - **weightClasses**: default value `true`, weight classes during the rating phase. - **removeReadabilityTags**: default value `true`, remove the data-readability tags inside the nodes that are added during the rating phase. - **fixRelativeURLs**: default value `false`, convert relative URLs to absolute. Like `/test` to `http://host/test`. - **substituteEntities**: default value `false`, disables the `substituteEntities` flag of libxml. Will avoid substituting HTML entities. Like `´` to รก. - **originalURL**: default value `http://fakehost`, original URL from the article used to fix relative URLs. ## Limitations Of course the main limitation is PHP. Websites that load the content through lazy loading, AJAX, or any type of javascript fueled call will be ignored (actually, *not ran*) and the resulting text will be incorrect, compared to the readability.js results. All the articles you want to parse with readability.php will need to be complete and all the content should be in the HTML already. ## Known Issues DOMDocument has some issues while parsing javascript with unescaped HTML on strings. Consider the following code: ```html
'; // I should not appear on the result