From time to time, I’ve had a need to grab specific pieces of information from log files, html pages, and data given to me by email rendered horribly by outlook/word/excel. Here’s an example of how managed to sanitize my data in one such incident. I needed the urls of the images from an html file. I ran the following (in Windows using Git for windows!):
cat index.html | tr '\t\r\n' ' ' | tr -s ' ' | sed 's/\"http/\nhttp/gi' | cut -d '"' -f1 | grep "http"
tr '\t\r\n' ' '
replaces all tabs, newlines to spaces. this makes the file one continuous line
tr -s ' '
merges duplicate spaces down to 1
splits the contents starting with “http. the resulting output at this time contains excess content after the link which we will deal with on the next line
cut -d '"' -f1
splitting the lines by quotes, we take the first column.
just gives us the filtered view of only http urls minus the first line that the sed command spat out.
This all assumes the urls all start with http. you’ll likely get more data you don’t want from doing the above, but this will give you the gist of what you need to do to apply it specfically for your task. Also consider looking into awk, sort, uniq