使用PHP从网站中提取特定数据

I am new in PHP and I was looking to extract data like inventory quantity and sizes from different websites. Was kind of confused on how I would go about doing this. Would Domdocument be the way to go?

Not sure if that was the best method for this.

I was attempting from lines 164-174 on here.

Any help is greatly appreciated!

EDIT - this is my updated code. Dont really think its the most efficient way to do things though.

<html>
<?php



$url = 'https://kithnyc.com/collections/adidas/products/kith-x-adidas-    consortium-response-trail-boost?variant=35276776455';
$html = file_get_contents($url);

//preg_match('~itemprop="image"\scontent="(\w+.\w+.\w+.\w+.\w+.\w+)~',     $html, $image);
//$image = $image[1];

preg_match('~,"title":"(\w+.\w+.\w+.\w+.\w+.\w+)~', $html, $title);
$title = $title[1];


preg_match_all('~{"id":(\d+)~', $html, $id);
$id = $id[1];

preg_match_all('~","public_title":"(\d+..)~', $html, $size);
$size = $size[1];

preg_match_all('~inventory_quantity":(\d+)~', $html, $quantity);
$quantity = $quantity[1];


function plain_url_to_link($url) {
return preg_replace(
    '%(https?|ftp)://([-A-Z0-9./_*?&;=#]+)%i',
    '<a target="blank" rel="nofollow" href="$0"      target="_blank">$0</a>', $url);
}



$i = 0;
$j = 2;

echo "$title<br />";
echo "<br />";

//echo $image;

echo plain_url_to_link($url);
echo "<br />";
echo "<br />";

for($i = 0; $i < 18; $i++) {
print "Size: $size[$i] --- Quantity: $quantity[$i] --- ID: $id[$j]";
$j++;
echo "<br />";
}


echo "<br />";
//print_r($quantity);




?>
</body>
</html>

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dpiuqwwyu187975836 2016-12-24 09:14
关注
As a general rule of thumb, you must avoid parsing HTML/XML content with regular expressions. Here's why:

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.

— https://stackoverflow.com/a/590789/65732

Use a DOM parser instead which is specifically designed for the purpose of parsing HTML/XML documents. Here's an example:

# Installing Symfony's dom parser using Composer composer require symfony/dom-crawler symfony/css-selector

<?php require 'vendor/autoload.php'; use Symfony\Component\DomCrawler\Crawler; $html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455'); $crawler = new Crawler($html); $price = $crawler->filter('.product-header-title[itemprop="price"]')->text(); // UPDATE: Does not work! as the page updates the button text // later with javascript. Read more for another solution. $in_stock = $crawler->filter('#AddToCartText')->text(); if ($in_stock == 'Sold Out') { $in_stock = 0; // or `false`, if you will } echo "Price: $price - Availability: $in_stock"; // Outputs: // Price: $220.00 - Availability: Buy Now // We'll fix "Availability" later...

Using such parsers, you have the ability to extract elements using XPath as well.

But if you want to parse the javascript code included in that page, you'd better use a browser emulator like Selenium. Then you have programmatic access to all the globally available javascript vars/functions in that page.

Update

Getting the price

So you were getting this error running the above code:

PHP Fatal error:
Uncaught Symfony\Component\CssSelector\Exception\SyntaxErrorException: Expected identifier, but found.

That's because the target page uses an invalid class name for the price element (.-price) and this Symfony's CSS selector component cannot parse it correctly, hence the exception. Here's the element:

<span id="ProductPrice" class="product-header-title -price" itemprop="price" content="220">$220.00</span>

To workaround it, let's use the itemprop attribute instead. Here's the selector that can match it:

.product-header-title[itemprop="price"]

I updated the above code accordingly to reflect it. I tested it and it's working for the price part.

Getting the stock status

Now that I actually tested the code, I see that the stock status of products is set later using javascript. It's not there when you fetch the page using file_get_contents(). You can see it for yourself, refresh the page, the button appears as Buy Now, then a second later it changes to Sold Out.

But fortunately, the quantity of the product variant is buried deep somewhere in the page. Here's a pretty printed copy of the huge object Shopify uses to render the product pages.

So now the problem is parsing javascript code with PHP. There are a few general approaches to tackle the problem:

Feel free to skip these approaches as they are not specific to your problem. Jump straight to number 6, if you just want a solution to your question.

The most reliable and common approach is to scrape data from such sites (that heavily rely on javascript) is to use a browser emulator like Selenium which are able to execute javascript code. Have a look at Facebook's PHP WebDriver package which is the most sophisticated PHP binding for Selenium WebDriver available. It provides you with an API to remotely control web browsers and execute javascript against them.

Also, see Behat's Mink that comes with various drivers for both headless browsers as well as full-fledged browser controllers. The drivers include Goutte, BrowserKit, Selenium1/2, Zombie.js, Sahi and WUnit.

See V8js, the PHP extension; which embeds V8 javascript engine into PHP. It allows you to evaluate javascript code right from your PHP script. But it's a little bit overkill to install a PHP extension if you're not heavily using the feature. But if you want to extract the relevant script using the DOM parser:

$script = $crawler->filterXPath('//head/following-sibling::script[2]')->text();

Use HtmlUnit to parse the page and then feed the final HTML to PHP. You gonna need a small Java wrapper. Right, overkill for your case.

Extract the javascript code and parse it using a JS parser/tokenizer library like hiltonjanfield/js4php5 or squizlabs/PHP_CodeSniffer which has a JS tokenizer.

In case that the application is making ajax calls to manipulate the DOM. You might be able to re-dispatch those requests and parse the response for your own application's sake. An example is the ajax call the page is making to cart.js to retrieve the data related to the cart items. But it's not the case for reading the product variant quantity here.

You may recall that I told you that it's a bad idea to utilize regular expressions to parse entire HTML/XML documents. But it's OK to use them partially to extract strings from an HTML/XML document when other approaches are even harder. Read the SO answer I quoted at the top of this post if you have any confusions about when to use it.

This approach is about matching the inventory_quantity of the product variant by running a simple regex against the whole page source (or you can only execute it against the script tag regarding a better performance):

<?php require 'vendor/autoload.php'; use Symfony\Component\DomCrawler\Crawler; $html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455'); $crawler = new Crawler($html); $price = trim($crawler->filter('.product-header-title[itemprop="price"]')->text()); preg_match('/35276776455,.+?inventory_quantity":(\d)/', $html, $in_stock); $in_stock = $in_stock[1]; echo "Price: $price - Availability: $in_stock"; // Outputs: // Price: $220.00 - Availability: 0

This regex needs a variant ID (35276776455 in this case) to work, as the quantity of each product comes with a variant. You can extract it from the URL's query string: ?variant=35276776455.

Now that we're done with the stock status and we've done it with regex, you might want to do the same with the price and drop the DOM parser dependency:

<?php $html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455'); // You need to check if it's matched before assigning // $price[1]. Anyway, this is just an example. preg_match('/itemprop="price".+?>\s*\$(.+?)\s*<\/span>/s', $html, $price); $price = $price[1]; preg_match('/35276776455,.+?inventory_quantity":(\d)/', $html, $in_stock); $in_stock = $in_stock[1]; echo "Price: $price - Availability: $in_stock"; // Outputs: // Price: $220.00 - Availability: 0

Conclusion

Even though that I still believe that it's a bad idea to parse HTML/XML documents with regex, I must admit that available DOM parsers are not able to parse embedded javascript code (and probably will never be), which is your case. We can partially utilize regular expressions to extract strings from HTML/XML; the parts which are not parsable using DOM parsers. So, all in all:

Use DOM parsers to parse/scrape the HTML code that initially exists in the page.

Intercept ajax calls that may include information you want. Re-call them in a separate http request to get the data.

Use browser emulators for parsing/scraping JS-heavy sites that populate their pages using ajax calls and such.

Partially use regex to extract what is not extractable using DOM parsers.

If you just want these two fields, you're fine to go with regex. Otherwise, consider other approaches.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

使用PHP从网站中提取特定数据 php
2016-12-24 03:13

回答 1 已采纳 As a general rule of thumb, you must avoid parsing HTML/XML content with regular expressions. Here
从PHP数组中提取特定数据 php
2019-01-09 09:01

回答 1 已采纳 Its JSON use this code: $json = json_decode($data[7],true); dump($json['id']);die;
从php字符串中提取特定数据 php
2016-11-16 02:45

回答 1 已采纳 A more refined approach as suggested by Robbie Averill //first lets the the query string alone $s
php提取特定_如何从数组中提取特定字段 - php
2021-03-22 22:58

阿荣田Toronto的博客 var_dump( strtotime('29.03.2015 03:00', time()) === strtotime('29.03.2015 04:00…PHP-全局变量的性能和内存问题 - php 假设情况：我在php中运行一个复杂的站点，并且我使用了很多全局变量。我可以将变量存储在...
使用php从html页面中的特定行提取数据 html php
2016-08-05 08:42

回答 2 已采纳 Store the file source into an array with $source = file('filename.html'); and extract line 12 and
使用PHP从JSON中提取特定值 json php
2013-04-02 19:38

回答 3 已采纳 $data['cards'] has another array within it. You will need to access this array with index 0. For i
使用PDO使用php从mysql数据库中提取特定值 mysql php
2017-02-23 16:57

回答 1 已采纳 SELECT userID, userName, userProfession, userPic FROM tbl_users WHERE userProfession = "Teacher" O
hext:特定于域的语言，用于从HTML文档中提取结构化数据
2021-05-18 05:40

Hext-从HTML提取数据 Hext是一种特定于域的语言，用于从HTML文档中提取结构化数据。有关，和实时演示，请参见。可以在上找到Hext网站的镜像。 Hext项目是根据Apache License v2.0的条款发布的。例子假设您要从...
从json格式中提取特定数据 json mysql php
2014-12-19 15:53

回答 3 已采纳 Put your json in a variable $str for example, than you can access the items : $json = json_decode
Yelp Fusion API - 使用PHP和JSON提取特定业务数据 json php
2017-03-30 21:02

回答 3 已采纳 I was able to get help on the Yelp GitHub Page https://github.com/Yelp/yelp-fusion/issues/202 B
如何使用PHP从段落中提取特定单词？ php
2010-07-16 09:02

回答 2 已采纳 Here's a snippet that should do as requested: $lines = explode(" ", $data); $output = array(); f
php如何特定的键的值,php怎么从数组中提取指定多个键对应的值
2021-03-25 11:17

格拉摩根终身伯爵的博客那么如何从一维或二维数组中提取多个指定的键中的值呢？提取数组指定多个键的值，例如：$arr=[1=>'张三',2=>'李四',3=>'王五'];提取1,3键的值，输出：Array([1]=>张三[3]=>王五)如果不用for只用内置...
如何从多维数组中提取特定的关键数据 php
2016-10-25 17:22

回答 2 已采纳 Say your array is stored in $arr you would access the comp index and then loop it since those are
php 获取指定键值,php从多维数组中获取特定的键值
2021-04-20 05:08

蔡振原的博客我们在项目开发过程中，有时候可能会需要在多维数组中获取特定的键和值。比如有一个多维数组，每个数组都有id, name, email等键。而你只需要从数组中获取所有名称，那么你如何去获取它呢?这里我们将使用array_column...
php 截取特定字符串
2022-10-21 10:28

小伟不加班的博客可以使用函数strripos,获取一个字符串在另一个字符串中第一次出现的位置。可以使用函数strrpos,获取一个字符串在另一个字符串中最后一次出现的位置。1、php 截取特定字符后面的内容。2、php 截取特定字符前面的内容...
没有解决我的问题, 去提问

悬赏问题

¥15 BV260Y用MQTT向阿里云发布主题消息一直错误
¥20 求个正点原子stm32f407开发版的贪吃蛇游戏
¥15 划分vlan后，链路不通了？
¥20 求各位懂行的人，注册表能不能看到usb使用得具体信息，干了什么，传输了什么数据
¥15 Vue3 大型图片数据拖动排序
¥15 Centos / PETGEM
¥15 划分vlan后不通了
¥20 用雷电模拟器安装百达屋apk一直闪退
¥15 算能科技20240506咨询（拒绝大模型回答）
¥15 自适应 AR 模型参数估计Matlab程序

使用PHP从网站中提取特定数据

1条回答 默认 最新

Update

Getting the price

Getting the stock status

Conclusion

悬赏问题

1条回答默认最新