# Readability.php [![Latest Stable Version](https://poser.pugx.org/andreskrey/readability.php/v/stable)](https://packagist.org/packages/andreskrey/readability.php) [![StyleCI](https://styleci.io/repos/71042668/shield?branch=master)](https://styleci.io/repos/71042668) [![Build Status](https://travis-ci.org/andreskrey/readability.php.svg?branch=master)](https://travis-ci.org/andreskrey/readability.php) [![Total Downloads](https://poser.pugx.org/andreskrey/readability.php/downloads)](https://packagist.org/packages/andreskrey/readability.php) [![Monthly Downloads](https://poser.pugx.org/andreskrey/readability.php/d/monthly)](https://packagist.org/packages/andreskrey/readability.php) PHP port of *Mozilla's* **[Readability.js](https://github.com/mozilla/readability)**. Parses html text (usually news and other articles) and returns title, byline and text content without nav bars, ads, footers, or anything that isn't the main body of the text. Analizes each text node, gives an score and orders them based on this calculation. **Requires**: PHP 5.4+ & DOMDocument (libxml) **Lead Developer**: Andres Rey ## Status Current status is stable. Version 1.0 is around the corner. ## How to use it First you have to require the library using composer: `composer require andreskrey/readability.php` Then, create and HTMLParser object with your preferences, feed the `parse()` function with your HTML and check the resulting array: ```php use andreskrey\Readability\HTMLParser; $readability = new HTMLParser(); $html = file_get_contents('http://your.favorite.newspaper/article.html'); $result = $readability->parse($html); ``` The `$result` variable now will hold the following information: ``` $result = [ 'title' => 'Title of the article', 'author' => 'Name of the author of the article', 'image' => 'Main image of the article', 'images' => 'All images of the article', 'article' => 'DOMDocument with the full article text, scored and parsed' ] ``` If the parsing process was unsuccessful the HTMLParser will return `false` ## Options - **maxTopCandidates**: default value `5`, max amount of top level candidates. - **wordThreshold**: default value `500`, minimum amount of characters to consider that the article was parsed successful. - **articleByLine**: default value `false`, search for the article byline and remove it from the text. It will be moved to the article metadata. - **stripUnlikelyCandidates**: default value `true`, remove nodes that are unlikely to have relevant information. Useful for debugging or parsing complex or non-standard articles. - **cleanConditionally**: default value `true`, remove certain nodes after parsing to return a cleaner result. - **weightClasses**: default value `true`, weight classes during the rating phase. - **removeReadabilityTags**: default value `true`, remove the data-readability tags inside the nodes that are added during the rating phase. - **fixRelativeURLs**: default value `false`, convert relative URLs to absolute. Like `/test` to `http://host/test`. - **substituteEntities**: default value `false`, disables the `substituteEntities` flag of libxml. Will avoid substituting HTML entities. Like `á` to á. - **normalizeEntities**: default value `false`, converts UTF-8 characters to its HTML Entity equivalent. Useful to parse HTML with mixed encoding. - **originalURL**: default value `http://fakehost`, original URL from the article used to fix relative URLs. - **summonCthulhu**: default value `false`, remove all ` ``` If you would like to remove the scripts of the HTML (like readability does), you would expect ending up with just one div and one comment on the final HTML. The problem is that libxml takes that closing div tag inside the javascript string as a HTML tag, effectively closing the unclosed tag and leaving the rest of the javascript as a string withing a P tag. If you save that node, the final HTML will end up like this: ```html

'; // I should not appear on the result

``` This is a libxml issue and not a Readability.php bug. There's a workaround for this: using the `summonCthulhu` option. This will remove all script tags **via regex**, which is not ideal because you may end up summoning [the lord of darkness](https://stackoverflow.com/a/1732454). ### entities disappearing ` ` entities are converted to spaces automatically by libxml and there's no way to disable it. ### Self closing tags rendering as fully expanded tags Self closing tags like `
` get automatically expanded to `
`) ## License Copyright (c) 2010 Arc90 Inc Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.