I have HTML with data inside that I am trying to get matches for. I am using bash to achieve this and as its not possible to do I am running the HTML into PUP (as recommended here on StackOverflow), using PUP I am then extracting some of the schema however I am left with large json with data I dont need, I am then running sed commands to delete lines that I do not require. I am trying to find a way using JQ on only selecting the data I need so I dont need to run SED commands to delete unwanted lines.
So i run the command:-
cat test.html | pup 'div.scene json{}' > out.json
The below is generated.
 [
  {
   "children": [
    {
     "children": [
      {
       "class": "icon-new active",
       "tag": "div"
      },
      {
       "children": [
        {
         "children": [
          {
           "alt": "Album Title - Artist Name - 1",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 2",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 3",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 4",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "alt": "Album Title - Artist Name - 5",
           "class": "lazy image-under",
           "data-src": "",
           "tag": "img",
           "title": "Album Title"
          },
          {
           "tag": "span"
          },
          {
           "tag": "span"
          },
          {
           "tag": "span"
          },
          {
           "tag": "span"
          },
          {
           "class": "last",
           "tag": "span"
          }
         ],
         "class": "sample-picker clearfix",
         "data-trackid": "bhangra-tracking-id",
         "href": "/bhangra/album/view/2842847/title-of-album/",
         "tag": "a",
         "title": "Album Title"
        }
       ],
       "class": "card-overlay",
       "tag": "div"
      },
      {
       "children": [
       {
         "alt": "Album Title",
         "class": "lazy card-main-img",
         "data-src": "",
         "tag": "img",
         "title": "Album Title"
        }
       ],
       "data-trackid": "bhangra-tracking-id  ",
       "href": "/bhangra/album/view/2842847/title-of-album/",
       "tag": "a",
       "title": "Album Title"
      }
     ],
     "class": "card-image",
     "tag": "div"
    },
    {
     "children": [
      {
       "children": [
        {
         "data-trackid": "scene-card-info-title Album Title ",
         "href": "/bhangra/album/view/2842847/title-of-album/",
         "tag": "a",
         "text": "Album Title",
         "title": "Album Title"
        }
       ],
       "class": "scene-card-title",
       "tag": "div"
      },
      {
       "children": [
        {
         "data-trackid": "scene-card-model name Artist Name modelid=1111 ",
         "href": "/bhangra/profile/view/2842847/artist-name/",
         "tag": "a",
         "text": "Artist Name",
         "title": "Artist Name"
        }
       ],
       "class": "model-names",
       "tag": "div"
      },
      {
       "tag": "time",
       "text": "September 08, 2018"
      },
      {
       "children": [
        {
         "children": [
          {
           "class": "label-left-box",
           "tag": "span",
           "text": "Website Name"
          },
          {
           "class": "label-text",
           "tag": "span",
           "text": "Website URL"
          }
         ],
         "class": "collection label-small",
         "data-trackid": "scene-card-collection",
         "href": "/bhangra/main/id/url/",
         "tag": "a",
         "title": "Website URL"
        },
        {
         "class": "label-hd ",
         "tag": "span"
        },
        {
         "children": [
          {
           "children": [
            {
             "class": "icons like-icon",
             "tag": "span"
            },
            {
             "class": "like-amount",
             "tag": "var",
             "text": "0"
            }
           ],
           "class": "likes",
           "tag": "span"
          },
          {
           "children": [
            {
             "class": "icons dislike-icon",
             "tag": "span"
            },
            {
             "class": "dislike-amount",
             "tag": "var",
             "text": "0"
            }
           ],
           "class": "dislikes",
           "tag": "span"
          }
         ],
         "class": "label-rating",
         "tag": "span"
        }
       ],
       "class": "bhangra-information",
       "tag": "div"
      }
     ],
     "class": "scene-card-info",
     "tag": "div"
    }
   ],
   "class": "bhangra-card scene ",
   "tag": "div"
  }
 ]
I am then using JQ to return some details I want.
 cat out.json | jq '.[] | {"1": .children[1].children[0].children, "2": .children[1].children[1].children, "date": .children[1].children[2].text}'
This is returning back the below.
 {
   "1": [
     {
       "data-trackid": "scene-card-info-title Album Title ",
       "href": "/bhangra/album/view/2842847/title-of-album/",
       "tag": "a",
       "text": "Album Title",
       "title": "Album Title"
     }
   ],
   "2": [
     {
       "data-trackid": "scene-card-model name Artist Name modelid=1111 ",
       "href": "/bhangra/profile/view/2842847/artist-name/",
       "tag": "a",
       "text": "Artist Name",
       "title": "Artist Name"
     }
   ],
   "date": "September 08, 2018"
 }
With the above the next Album2 also has key's of 1 & 2 followed by date, this results in the syntax being invalid and me not being able to target the data I want as the keys are all the same.
In order to fix this I am then running a bunch of sed commands to remove the lines that I don't need from the above.
The below is what I would like to be returned from my initial jq query but just unsure how I get this specific data returned.
 { 
   "1" : {
            "album": "Album Title",
            "href": "/bhangra/album/view/2842847/title-of-album/",
            "artist": "Artist Name",
            "date": "September 08, 2018"
   },
   "2" : {
            "album": "Album1 Title",
            "href": "/bhangra/album/view/2842847/title-of-album/",
            "artist": "Artist1 Name",
            "date": "September 08, 2018"
   },
   "3" : {
            "album": "Album2 Title",
            "href": "/bhangra/album/view/2842847/title-of-album/",
            "artist": "Artist2 Name",
            "date": "September 09, 2018"
   }
 }
UPDATE EDIT 11/09/2018
So I have made some slight progress on this, using the below query I have managed to pull back the data I require however they are still separate arrays.
 cat out.json | jq '.[] | .children[1].children[0].children[], .children[1].children[1].children[], .children[1].children[2] | {WTF: .title, href, text}'
This outputs the below which has got me slightly closer to what I want (above last example).
 {
   "WTF": "Album Title",
   "href": "/bhangra/album/view/2842847/title-of-album/",
   "text": "Album Title"
 }
   "WTF": "Artist Name",
   "href": "/bhangra/profile/view/2842847/artist-name/",
   "text": "Artist Name"
 }
 {
   "WTF": "Null",
   "href": "Null",
   "text": "September 08, 2018"
 }