使用JSOUP和COLDFUSION从网页中提取结构化数据

发布于 2025-01-26 19:24:33 字数 5242 浏览 2 评论 0 原文

我需要使用ColdFusion使用JSOUP(或任何其他有效方法)从网站上提取结构化数据。

数据的结构如下:

我需要从页面中获取JSON并将其解析为可用的变量。

我尝试了许多不同的选项而没有成功。我不知道JSOUP,并感谢您的帮助。

数据看起来像这样:

<script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@type": "Recipe",
  "name": "Party Coffee Cake",
  "image": [
    "https://example.com/photos/1x1/photo.jpg",
    "https://example.com/photos/4x3/photo.jpg",
    "https://example.com/photos/16x9/photo.jpg"
  ],
  "author": {
    "@type": "Person",
    "name": "Mary Stone"
  },
  "datePublished": "2018-03-10",
  "description": "This coffee cake is awesome and perfect for parties.",
  "prepTime": "PT20M",
  "cookTime": "PT30M",
  "totalTime": "PT50M",
  "keywords": "cake for a party, coffee",
  "recipeYield": "10",
  "recipeCategory": "Dessert",
  "recipeCuisine": "American",
  "nutrition": {
    "@type": "NutritionInformation",
    "calories": "270 calories"
  },
  "recipeIngredient": [
    "2 cups of flour",
    "3/4 cup white sugar",
    "2 teaspoons baking powder",
    "1/2 teaspoon salt",
    "1/2 cup butter",
    "2 eggs",
    "3/4 cup milk"
    ],
  "recipeInstructions": [
    {
      "@type": "HowToStep",
      "name": "Preheat",
      "text": "Preheat the oven to 350 degrees F. Grease and flour a 9x9 inch pan.",
      "url": "https://example.com/party-coffee-cake#step1",
      "image": "https://example.com/photos/party-coffee-cake/step1.jpg"
    },
    {
      "@type": "HowToStep",
      "name": "Mix dry ingredients",
      "text": "In a large bowl, combine flour, sugar, baking powder, and salt.",
      "url": "https://example.com/party-coffee-cake#step2",
      "image": "https://example.com/photos/party-coffee-cake/step2.jpg"
    },
    {
      "@type": "HowToStep",
      "name": "Add wet ingredients",
      "text": "Mix in the butter, eggs, and milk.",
      "url": "https://example.com/party-coffee-cake#step3",
      "image": "https://example.com/photos/party-coffee-cake/step3.jpg"
    },
    {
      "@type": "HowToStep",
      "name": "Spread into pan",
      "text": "Spread into the prepared pan.",
      "url": "https://example.com/party-coffee-cake#step4",
      "image": "https://example.com/photos/party-coffee-cake/step4.jpg"
    },
    {
      "@type": "HowToStep",
      "name": "Bake",
      "text": "Bake for 30 to 35 minutes, or until firm.",
      "url": "https://example.com/party-coffee-cake#step5",
      "image": "https://example.com/photos/party-coffee-cake/step5.jpg"
    },
    {
      "@type": "HowToStep",
      "name": "Enjoy",
      "text": "Allow to cool and enjoy.",
      "url": "https://example.com/party-coffee-cake#step6",
      "image": "https://example.com/photos/party-coffee-cake/step6.jpg"
    }
  ],
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "5",
    "ratingCount": "18"
  },
  "video": {
    "@type": "VideoObject",
    "name": "How to make a Party Coffee Cake",
    "description": "This is how you make a Party Coffee Cake.",
    "thumbnailUrl": [
      "https://example.com/photos/1x1/photo.jpg",
      "https://example.com/photos/4x3/photo.jpg",
      "https://example.com/photos/16x9/photo.jpg"
     ],
    "contentUrl": "http://www.example.com/video123.mp4",
    "embedUrl": "http://www.example.com/videoplayer?video=123",
    "uploadDate": "2018-02-05T08:00:00+08:00",
    "duration": "PT1M33S",
    "interactionStatistic": {
      "@type": "InteractionCounter",
      "interactionType": { "@type": "WatchAction" },
      "userInteractionCount": 2347
    },
    "expires": "2019-02-05T08:00:00+08:00"
  }
}
</script>

我尝试了以下内容:

<cfset source = "https://www.allrecipes.com/recipe/216319/homemade-sweet-italian-sausage-mild-or-hot/">

<cfhttp method="get" url="#source#" result="theresult" useragent="Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/533.7 (KHTML, like Gecko) Chrome/5.0.391.0 Safari/533.7"> 
<cfhttpparam type="header" name="Accept-Encoding" value="gzip,deflate,sdch" >
<cfhttpparam type="header" name="Proxy-Connection" value="keep-alive" >
<cfhttpparam type="header" name="Accept" value="application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5">
<cfhttpparam type="header" name="Accept-Language" value="en-US,en;q=0.8">
<cfhttpparam type="header" name="Accept-Charset" value="ISO-8859-1,utf-8;q=0.7,*;q=0.3">
<cfhttpparam type="cookie" name="some-cookie" value="1">
</cfhttp>

在上面的他中,我获得了网页。

然后,我尝试提取JSON:

<cfscript>
// Create the jsoup object
Jsoup = createObject("java", "org.jsoup.Jsoup");
// HTML string
html = "#theresult.filecontent#";
// Parse the string
document = Jsoup.parse(html);
// Extract content
title = document.title();
tags = document.select("script[type=application/ld+json]"); 
</cfscript>
<cfdump var="#tags#">
<cfloop index="e" array="#tags#">
<cfoutput>
    #e.attr("content")#<br>
</cfoutput>
</cfloop>

但是我什么也没返回。

I need to extract structured data for recipes from a website using JSOUP (or any other effective method) using Coldfusion.

The data is structure as follows: https://developers.google.com/search/docs/advanced/structured-data/recipe

I need to get the JSON from the page and parse it into useable variables.

I have tried a number of different options without success. I do not know JSOUP and will appreciate your help.

The data looks like this:

<script type="application/ld+json">
{
  "@context": "https://schema.org/",
  "@type": "Recipe",
  "name": "Party Coffee Cake",
  "image": [
    "https://example.com/photos/1x1/photo.jpg",
    "https://example.com/photos/4x3/photo.jpg",
    "https://example.com/photos/16x9/photo.jpg"
  ],
  "author": {
    "@type": "Person",
    "name": "Mary Stone"
  },
  "datePublished": "2018-03-10",
  "description": "This coffee cake is awesome and perfect for parties.",
  "prepTime": "PT20M",
  "cookTime": "PT30M",
  "totalTime": "PT50M",
  "keywords": "cake for a party, coffee",
  "recipeYield": "10",
  "recipeCategory": "Dessert",
  "recipeCuisine": "American",
  "nutrition": {
    "@type": "NutritionInformation",
    "calories": "270 calories"
  },
  "recipeIngredient": [
    "2 cups of flour",
    "3/4 cup white sugar",
    "2 teaspoons baking powder",
    "1/2 teaspoon salt",
    "1/2 cup butter",
    "2 eggs",
    "3/4 cup milk"
    ],
  "recipeInstructions": [
    {
      "@type": "HowToStep",
      "name": "Preheat",
      "text": "Preheat the oven to 350 degrees F. Grease and flour a 9x9 inch pan.",
      "url": "https://example.com/party-coffee-cake#step1",
      "image": "https://example.com/photos/party-coffee-cake/step1.jpg"
    },
    {
      "@type": "HowToStep",
      "name": "Mix dry ingredients",
      "text": "In a large bowl, combine flour, sugar, baking powder, and salt.",
      "url": "https://example.com/party-coffee-cake#step2",
      "image": "https://example.com/photos/party-coffee-cake/step2.jpg"
    },
    {
      "@type": "HowToStep",
      "name": "Add wet ingredients",
      "text": "Mix in the butter, eggs, and milk.",
      "url": "https://example.com/party-coffee-cake#step3",
      "image": "https://example.com/photos/party-coffee-cake/step3.jpg"
    },
    {
      "@type": "HowToStep",
      "name": "Spread into pan",
      "text": "Spread into the prepared pan.",
      "url": "https://example.com/party-coffee-cake#step4",
      "image": "https://example.com/photos/party-coffee-cake/step4.jpg"
    },
    {
      "@type": "HowToStep",
      "name": "Bake",
      "text": "Bake for 30 to 35 minutes, or until firm.",
      "url": "https://example.com/party-coffee-cake#step5",
      "image": "https://example.com/photos/party-coffee-cake/step5.jpg"
    },
    {
      "@type": "HowToStep",
      "name": "Enjoy",
      "text": "Allow to cool and enjoy.",
      "url": "https://example.com/party-coffee-cake#step6",
      "image": "https://example.com/photos/party-coffee-cake/step6.jpg"
    }
  ],
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "5",
    "ratingCount": "18"
  },
  "video": {
    "@type": "VideoObject",
    "name": "How to make a Party Coffee Cake",
    "description": "This is how you make a Party Coffee Cake.",
    "thumbnailUrl": [
      "https://example.com/photos/1x1/photo.jpg",
      "https://example.com/photos/4x3/photo.jpg",
      "https://example.com/photos/16x9/photo.jpg"
     ],
    "contentUrl": "http://www.example.com/video123.mp4",
    "embedUrl": "http://www.example.com/videoplayer?video=123",
    "uploadDate": "2018-02-05T08:00:00+08:00",
    "duration": "PT1M33S",
    "interactionStatistic": {
      "@type": "InteractionCounter",
      "interactionType": { "@type": "WatchAction" },
      "userInteractionCount": 2347
    },
    "expires": "2019-02-05T08:00:00+08:00"
  }
}
</script>

I have tried the following:

<cfset source = "https://www.allrecipes.com/recipe/216319/homemade-sweet-italian-sausage-mild-or-hot/">

<cfhttp method="get" url="#source#" result="theresult" useragent="Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/533.7 (KHTML, like Gecko) Chrome/5.0.391.0 Safari/533.7"> 
<cfhttpparam type="header" name="Accept-Encoding" value="gzip,deflate,sdch" >
<cfhttpparam type="header" name="Proxy-Connection" value="keep-alive" >
<cfhttpparam type="header" name="Accept" value="application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5">
<cfhttpparam type="header" name="Accept-Language" value="en-US,en;q=0.8">
<cfhttpparam type="header" name="Accept-Charset" value="ISO-8859-1,utf-8;q=0.7,*;q=0.3">
<cfhttpparam type="cookie" name="some-cookie" value="1">
</cfhttp>

With he above I get the web page.

I then try to extract the JSON:

<cfscript>
// Create the jsoup object
Jsoup = createObject("java", "org.jsoup.Jsoup");
// HTML string
html = "#theresult.filecontent#";
// Parse the string
document = Jsoup.parse(html);
// Extract content
title = document.title();
tags = document.select("script[type=application/ld+json]"); 
</cfscript>
<cfdump var="#tags#">
<cfloop index="e" array="#tags#">
<cfoutput>
    #e.attr("content")#<br>
</cfoutput>
</cfloop>

But I get nothing returned.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

霊感 2025-02-02 19:24:33
 &lt; script type =“ application/ld+json”&gt; [
    {
      “ @context”:“ http://schema.org”,
      “ @Type”:“ Bread Crumblist”,
       ...
    }
&lt;/script&gt;
 

&lt; script&gt; 标签没有名为“ content” 的属性(只有一个名为“ type” )。要检索标签内容(或其内部html),请使用 element.html()方法。然后将返回的内容当作json:

<cfscript>
   Jsoup = createObject("java", "org.jsoup.Jsoup");
   document = Jsoup.parse( theResult.fileContent );
   tag = document.select("script[type=application/ld+json]").first(); 
   if (isJSON(tag.html())) {
       contents = deserializeJSON( tag.html() );
       writeDump(contents);
   }
</cfscript>
<script type="application/ld+json">[
    {
      "@context": "http://schema.org",
      "@type": "BreadcrumbList",
       ...
    }
</script>

The <script> tag doesn't have an attribute named "content" (only one named "type"). To retrieve the tag contents (or its inner html) use the Element.html() method. Then deserialize the returned contents as json:

<cfscript>
   Jsoup = createObject("java", "org.jsoup.Jsoup");
   document = Jsoup.parse( theResult.fileContent );
   tag = document.select("script[type=application/ld+json]").first(); 
   if (isJSON(tag.html())) {
       contents = deserializeJSON( tag.html() );
       writeDump(contents);
   }
</cfscript>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文