As a general rule of thumb, you must avoid parsing HTML/XML content with regular expressions. Here's why:
Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.
Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.
# Installing Symfony's dom parser using Composer composer require symfony/dom-crawler symfony/css-selector
Using such parsers, you have the ability to extract elements using XPath as well.
Getting the price
So you were getting this error running the above code:
PHP Fatal error:
Uncaught Symfony\Component\CssSelector\Exception\SyntaxErrorException: Expected identifier, but found.
That's because the target page uses an invalid class name for the price element (
.-price) and this Symfony's CSS selector component cannot parse it correctly, hence the exception. Here's the element:
<span id="ProductPrice" class="product-header-title -price" itemprop="price" content="220">$220.00</span>
To workaround it, let's use the
itemprop attribute instead. Here's the selector that can match it:
I updated the above code accordingly to reflect it. I tested it and it's working for the price part.
Getting the stock status
file_get_contents(). You can see it for yourself, refresh the page, the button appears as
Buy Now, then a second later it changes to
But fortunately, the quantity of the product variant is buried deep somewhere in the page. Here's a pretty printed copy of the huge object Shopify uses to render the product pages.
Feel free to skip these approaches as they are not specific to your problem. Jump straight to number 6, if you just want a solution to your question.
Also, see Behat's Mink that comes with various drivers for both headless browsers as well as full-fledged browser controllers. The drivers include Goutte, BrowserKit, Selenium1/2, Zombie.js, Sahi and WUnit.
$script = $crawler->filterXPath('//head/following-sibling::script')->text();
In case that the application is making ajax calls to manipulate the DOM. You might be able to re-dispatch those requests and parse the response for your own application's sake. An example is the ajax call the page is making to
cart.jsto retrieve the data related to the cart items. But it's not the case for reading the product variant quantity here.
You may recall that I told you that it's a bad idea to utilize regular expressions to parse entire HTML/XML documents. But it's OK to use them partially to extract strings from an HTML/XML document when other approaches are even harder. Read the SO answer I quoted at the top of this post if you have any confusions about when to use it.
This approach is about matching the
inventory_quantity of the product variant by running a simple regex against the whole page source (or you can only execute it against the script tag regarding a better performance):
<?php require 'vendor/autoload.php'; use Symfony\Component\DomCrawler\Crawler; $html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455'); $crawler = new Crawler($html); $price = trim($crawler->filter('.product-header-title[itemprop="price"]')->text()); preg_match('/35276776455,.+?inventory_quantity":(\d)/', $html, $in_stock); $in_stock = $in_stock; echo "Price: $price - Availability: $in_stock"; // Outputs: // Price: $220.00 - Availability: 0
This regex needs a variant ID (
35276776455 in this case) to work, as the quantity of each product comes with a variant. You can extract it from the URL's query string:
Now that we're done with the stock status and we've done it with regex, you might want to do the same with the price and drop the DOM parser dependency:
<?php $html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455'); // You need to check if it's matched before assigning // $price. Anyway, this is just an example. preg_match('/itemprop="price".+?>\s*\$(.+?)\s*<\/span>/s', $html, $price); $price = $price; preg_match('/35276776455,.+?inventory_quantity":(\d)/', $html, $in_stock); $in_stock = $in_stock; echo "Price: $price - Availability: $in_stock"; // Outputs: // Price: $220.00 - Availability: 0
- Use DOM parsers to parse/scrape the HTML code that initially exists in the page.
- Intercept ajax calls that may include information you want. Re-call them in a separate http request to get the data.
- Use browser emulators for parsing/scraping JS-heavy sites that populate their pages using ajax calls and such.
- Partially use regex to extract what is not extractable using DOM parsers.
If you just want these two fields, you're fine to go with regex. Otherwise, consider other approaches.