AppleScript Screen Scraping to Excel Example

NEW! AppleScript Maker Beta: Hey Everyone, I am working on a new AppleScript tool that is going to BLOW YOUR MIND! After reading this tutorial use this tool to easily build your AppleScripts. Click Here!

In previous tutorials we learned how to extract information from a web page. In this tutorial I am going to show you another real world way to extract information and then output it to an excel document.

If have not read my previous  tutorials on clicking, extracting, and inputting data from a web page, this will not make much sense. Please view these first!

In this example we are going to be grabbing the previous episode number of a few top television shows and then outputting them to an Excel sheet.

Scraping the Text

First we are going to go to next-episode.net and inspect the element for the first top show Game of Thrones.

Screen Shot 2015-06-09 at 2.10.26 PM

 

It looks like there is no ID, class, or name element… but we can grab it by the “a” attribute tag.

Here is a nifty solution I came up with to find the element and then click on it.

 

First place this at the top of your AppleScript doc.

to getInputByTag(theTag, num) -- defines a function with two inputs, theTag and num

 tell application "Safari" --tells AS that we are going to use Safari

 set input to do JavaScript "

document.getElementsByTagName('" & theTag & "')[" & num & "].innerHTML;" in document 1

end tell

return input

end getInputByTag

 

Also place this at the top of your doc, it will let you click on an element by tag name:

to clicktagName(thetagName, elementnum)

tell application "Safari"

do JavaScript "document.getElementsByTagName('" & thetagName & "')[" & elementnum & "].click();" in document 1

end tell

end clicktagName

Next place this in your code to search for and click on an element:

set x to 0 -- this is our test variable

set tvshow to "Game of Thrones"

repeat 1000 times -- this is going to repeat with every element number until what we are searching for is found

 set link to getInputByTag("a", x) -- this is going to retrieve the input of the element with tag a and the element number x

 if tvshow is in link then -- checks to see if Game of Thrones is in the output

 exit repeat -- if its found then we stop looking

end if

 set x to x + 1 -- if its not found we continue to the next element

end repeat

clicktagName("a", x) -- clicks the tag that was found to have game of thrones in it

 

Now, when you run this script it will open the page for Game of Thrones.

Next we need to take a look at our desired information… the previous episode.

Screen Shot 2015-06-09 at 2.23.23 PM

It looks like, again, there is no way to grab the information directly. There is no name, class, or ID to reference for us to pull.

So… in this case we need to back up and grab the id from the div that contains what we are looking for. Here we are going to use the id=”previous_episode”

To do this, input the following code into the top of your AppleScript:

to getInputById(theId)

tell application "Safari"

 set input to do JavaScript "

document.getElementById('" & theId & "').innerHTML;" in document 1

end tell

return input

end getInputById

 

Next we add the following code to our Applescript under what we already have. I would add a delay of a second or two to make time for the page to load.

 

delay 2

getInputById("previous_episode")

When we run this we get this whole mess of HTML it should look like this:

<div class=”headline” style=”padding-left:0px;”><h2>Previous Episode</h2></div>

<div></div>

<div style=”margin-top:5px;”><div class=”subheadline” style=”margin-top:0px;top:1;position:absolute”><h3>Name:</h3></div><div style=”display:inline-block;width:260px;margin-left:90px;”>The Dance of Dragons</div></div>

<div></div>

<div class=”subheadline”><h3>Date:</h3></div>Sun Jun 07, 2015<div></div>

<div class=”subheadline”><h3>Season:</h3></div>5<div></div>

<div style=”margin-top:5px;”><div class=”subheadline” style=”margin-top:0px;top:1;position:absolute”>Episode:</div><div style=”display:inline-block;width:260px;margin-left:90px;”>9</div></div>

<div></div>

<div style=”margin-top:5px;”><div class=”subheadline” style=”margin-top:0px;top:1;position:absolute”>Summary:</div><div id=”lastEpSummary” style=”display:inline-block;width:260px;margin-left:90px;”><a href=”javascript: void(0)” onclick=”getSummary2(‘lastEpSummary’, ‘http://www.tvrage.com/Game_of_Thrones/episodes/1065797704’)”>Episode Summary</a></div></div>

<div></div>

 

If we take a look at this we find the following with our desired information right in the middle.

Episode:</div><div style=”display:inline-block;width:260px;margin-left:90px;”>9</div></div>

Next we are going to run our extraction script to pull out the episode number, place this at the top of your page:

to extractBetween(SearchText, startText1, startText2, endText)

 set tid to AppleScript's text item delimiters

 set AppleScript's text item delimiters to startText1

 set endItems to text item -1 of SearchText

 set AppleScript's text item delimiters to endText

 set beginningToEnd to text item 1 of endItems

 set AppleScript's text item delimiters to startText2

set finalText to (text items 2 thru -1 of beginningToEnd) as text

 set AppleScript's text item delimiters to tid

 return finalText

end extractBetween

Next place this code in your Script:

 

set prevEpisodeDiv to getInputById("previous_episode") -- grabs mass of HTML

set startText1 to "Episode:</div>" -- looks for something that says Episode

set startText2 to "<div style="display:inline-block;width:260px;margin-left:90px;">" -- looks for something that matches this line of text after episode

set endText to "</div>" -- looks for </div> after start text 1 and 2

set prevEpisodeNumber to extractBetween(prevEpisodeDiv, startText1, startText2, endText) --grabs what is inbetween the start text and the end text which is our episode number!

 

Using AppleScript to Write to Excel

Next we want to output this to an Excel doc. So we will tell applescript to open Excel, make a new doc, add headers, and paste the show and previous episode number into cells A2 and B2:

Use this script to accomplish this:

tell application "Microsoft Excel"

activate

make new document

set value of cell ("a1") to "TV Show"

set value of cell ("b1") to "Previous Episode Number"

set value of cell ("a2") to tvshow

set value of cell ("b2") to prevEpisodeNumber

end tell

 

Screen Shot 2015-06-09 at 2.46.46 PM

 

Awesome right??

 

Automating the Scrape

 

Ok this is cool… but this isn’t really automation if we have to do this for every page…

Now here is where the fun begins with AppleScript’s repeat function. In order to do a list of shows we need to reconfigure the order of the applescript a bit and add in a few things.

 

First we are going to change the “Make the new Excel document” part of our script and place first instead of last…Keep the whole writing to the excel part at the bottom of the script. (below is a full example of what the finished script should look like)

QUICK EXCEL TUTORIAL

After that we are going to make a list of all the shows we want to grab. I am going to do this by copying the sidebar of the site to an Excel doc. Paste the shows into cells A2 and below, we need a blank cell in order to do the formula.

Screen Shot 2015-06-09 at 3.06.13 PM

Next copy this formula into cell B2 and copy it down. (Click on cell B2 and double click on the box at the bottom right corner)

=CONCATENATE(B1,””””,A2,””””,”,”)

Now copy the last cell in column B that you are using.

Screen Shot 2015-06-09 at 3.08.43 PM

And in the top of your AppleScript doc write this:

set tvShows to {}

Then paste your list into the middle of the two brackets. Make sure to delete the comma for the last item in our list.

It should now look like this:

set tvShows to {"Game of Thrones", "The Big Bang Theory", "The Walking Dead", "Homeland", "Suits", "Arrow", "True Blood", "Modern Family", "Sherlock", "Supernatural", "Under the Dome", "New Girl", "Marvel's Agents of S.H.I.E.L.D.", "The Blacklist", "The Vampire Diaries", "Person of Interest", "Grey's Anatomy", "Vikings", "Once Upon a Time"}

Now above the line that says:

set x to 0

You are going to write:

repeat with tvshow in tvShows

delete the line below that that says:

set tvshow to "Game of Thrones"

And for the very last line in your script write:

end repeat

 

Ok sorry that was a lot really fast…

What we are doing here is setting the script to go through our list of shows one by one.

The last thing we need to do is set up Excel to move to the next line every time a new show’s information is pulled.

To do this write this code above repeat with tvshow in tvShows :

set y to 2

Then change the part where you are writing your info to excel to to this:

 

tell application "Microsoft Excel"

set value of cell ("a" & y) to tvshow

set value of cell ("b" & y) to prevEpisodeNumber

end tell

And finally right above end repeat write:

set y to y + 1

And you are done!

I know that last part kind of got hairy… so here is a recap of what the final script should look like. I hope you found this example useful!

to getInputById(theId)

tell application "Safari"

 set input to do JavaScript "

document.getElementById('" & theId & "').innerHTML;" in document 1

end tell

return input

end getInputById

to clicktagName(thetagName, elementnum)

tell application "Safari"

do JavaScript "document.getElementsByTagName('" & thetagName & "')[" & elementnum & "].click();" in document 1

end tell

end clicktagName

to getInputByTag(theTag, num) -- defines a function with two inputs, theTag and num

 tell application "Safari" --tells AS that we are going to use Safari

 set input to do JavaScript "

document.getElementsByTagName('" & theTag & "')[" & num & "].innerHTML;" in document 1

end tell

return input

end getInputByTag

to extractBetween(SearchText, startText1, startText2, endText)

 set tid to AppleScript's text item delimiters

 set AppleScript's text item delimiters to startText1

 set endItems to text item -1 of SearchText

 set AppleScript's text item delimiters to endText

 set beginningToEnd to text item 1 of endItems

 set AppleScript's text item delimiters to startText2

set finalText to (text items 2 thru -1 of beginningToEnd) as text

 set AppleScript's text item delimiters to tid

 return finalText

end extractBetween

tell application "Microsoft Excel"

activate

make new document

set value of cell ("a1") to "TV Show"

set value of cell ("b1") to "Previous Episode Number"

end tell

set tvShows to {"Game of Thrones", "The Big Bang Theory", "The Walking Dead", "Homeland", "Suits", "Arrow", "True Blood", "Modern Family", "Sherlock", "Supernatural", "Under the Dome", "New Girl", "Marvel's Agents of S.H.I.E.L.D.", "The Blacklist", "The Vampire Diaries", "Person of Interest", "Grey's Anatomy", "Vikings", "Once Upon a Time"}

set y to 2

repeat with tvshow in tvShows

 set x to 0 -- this is our test variable

 repeat 1000 times -- this is going to repeat with every element number until what we are searching for is found

 set link to getInputByTag("a", x) -- this is going to retrieve the input of the element with tag a and the element number x

 if tvshow is in link then -- checks to see if Game of Thrones is in the output

 exit repeat -- if its found then we stop looking

end if

 set x to x + 1 -- if its not found we continue to the next element

end repeat

clicktagName("a", x) -- clicks the tag that was found to have game of thrones in it

delay 2

set prevEpisodeDiv to getInputById("previous_episode") -- grabs mass of HTML

 set startText1 to "Episode:</div>" -- looks for something that says Episode

set startText2 to "<div style="display:inline-block;width:260px;margin-left:90px;">" -- looks for something that matches this line of text after episode

 set endText to "</div>" -- looks for </div> after start text 1 and 2

 set prevEpisodeNumber to extractBetween(prevEpisodeDiv, startText1, startText2, endText) --grabs what is inbetween the start text and the end text which is our episode number!

tell application "Microsoft Excel"

set value of cell ("a" & y) to tvshow

set value of cell ("b" & y) to prevEpisodeNumber

end tell

set y to y + 1

end repeat

Leave a Reply

Your email address will not be published. Required fields are marked *