Module: Gimme
- Defined in:
- lib/gimme_poc.rb,
lib/gimme_poc/poc.rb,
lib/gimme_poc/web.rb,
lib/gimme_poc/save.rb,
lib/gimme_poc/version.rb,
lib/gimme_poc/questions.rb,
lib/gimme_poc/contactpage.rb
Overview
Find the contact
Defined Under Namespace
Classes: Search
Constant Summary collapse
- PHONE_REGEX =
Simple regex that looks for ###.#### or ###-####
/(\d{3}[-]\d{4}|\d{3}[.]\d{4})/
- HTTP_REGEX =
Captures http:// and https://
%r{(\A\bhttps:\/\/|\bhttp:\/\/)}
- VERSION =
'0.0.5'
Class Attribute Summary collapse
-
.contact ⇒ Object
Returns the value of attribute contact.
-
.contact_links ⇒ Object
Returns the value of attribute contact_links.
-
.page ⇒ Object
Returns the value of attribute page.
-
.url ⇒ Object
Returns the value of attribute url.
Class Method Summary collapse
-
.blind_test(url) ⇒ Object
TODO: Sometimes DNS will do a redirect and not give a 404.
-
.contact_page(url) ⇒ Object
Looks for contact page.
-
.contactform_available? ⇒ Boolean
TODO: build better conditional to prevent false positives.
-
.delete_failures(hsh) ⇒ Object
Remove negatives from the contacts hash.
-
.email_available? ⇒ Boolean
Boolean, returns true if email is present.
-
.english_contact_page(url) ⇒ Object
Looks for english page.
-
.format_url(str) ⇒ Object
Mechanize needs absolute urls to work.
-
.get(str) ⇒ Object
Go to a page using Mechanize.
-
.go_to_contact_page(url) ⇒ Object
Scans for contact page.
-
.link_with_href(str) ⇒ Object
Expects relative paths and merges everything.
-
.memory ⇒ Object
Convenience method.
-
.merged_link(url_str) ⇒ Object
Used in case of relative paths.
-
.orig_domain(str) ⇒ Object
Outputs domain of a url.
-
.phone_available? ⇒ Boolean
Boolean, returns true if phone number is present.
-
.poc(arr) ⇒ Object
The main method! Takes array of urls and gets contact info for each if possible.
-
.reset! ⇒ Object
Clears entire collection.
-
.save_available_contacts(url, hsh = scan_for_contacts) ⇒ Object
Saves any available contact info to @contact_links.
-
.save_link(key, url) ⇒ Object
Used in save_available_contacts to save each valid link.
-
.scan_for_contacts ⇒ Object
Returns anything that is possible to save, otherwise returns nil.
-
.something_to_save?(hsh) ⇒ Boolean
Boolean, returns true if anything is present after running scan_for_contacts and deleting failures.
-
.start_contact_links ⇒ Object
Starts/Restarts @contacts_links hash.
-
.subdomain?(str) ⇒ Boolean
Boolean, returns true if url is not identical to original domain.
-
.unformat_url(str) ⇒ Object
Used for subdomain check.
Class Attribute Details
.contact ⇒ Object
Returns the value of attribute contact.
14 15 16 |
# File 'lib/gimme_poc.rb', line 14 def contact @contact end |
.contact_links ⇒ Object
Returns the value of attribute contact_links.
14 15 16 |
# File 'lib/gimme_poc.rb', line 14 def contact_links @contact_links end |
.page ⇒ Object
Returns the value of attribute page.
14 15 16 |
# File 'lib/gimme_poc.rb', line 14 def page @page end |
.url ⇒ Object
Returns the value of attribute url.
14 15 16 |
# File 'lib/gimme_poc.rb', line 14 def url @url end |
Class Method Details
.blind_test(url) ⇒ Object
TODO: Sometimes DNS will do a redirect and not give a 404.
Need to prevent redirects.
Blindly tests to see if a url goes through. If there is a 404 error, this will return nil.
86 87 88 89 |
# File 'lib/gimme_poc/web.rb', line 86 def blind_test(url) puts "\n(blind testing: #{url})" get(url) end |
.contact_page(url) ⇒ Object
Looks for contact page. Gets page if available. If no contact link is available, it will blind test ‘../contact’. Returns nil if nothing can be found.
17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# File 'lib/gimme_poc/contactpage.rb', line 17 def contact_page(url) puts 'now looking for contact pages' contact_link = link_with_href(/contact|Contact/) contact_test_page = merged_link('../contact') case when !contact_link.nil? puts "#{'Success:'.green} Found contact link!\n" get(merged_link(contact_link)) else puts "#{'Warning:'.yellow} couldn't find contact link" blind_test(contact_test_page) || get(orig_domain(url)) end end |
.contactform_available? ⇒ Boolean
TODO: build better conditional to prevent false positives.
There could be other forms like newsletter signup, etc.
If there is a form with more than one field, this returns true. Forms with one field are typically search boxes.
Boolean, returns true if form is present on page.
29 30 31 |
# File 'lib/gimme_poc/questions.rb', line 29 def contactform_available? !(page.forms.select { |x| x.fields.length > 1 }.empty?) end |
.delete_failures(hsh) ⇒ Object
Remove negatives from the contacts hash. Deletes a key value pair with a value of either nil or false. Remember that false is a string.
38 39 40 |
# File 'lib/gimme_poc/save.rb', line 38 def delete_failures(hsh) hsh.delete_if { |_k, v| v.nil? || v == 'false' } end |
.email_available? ⇒ Boolean
Boolean, returns true if email is present.
12 13 14 |
# File 'lib/gimme_poc/questions.rb', line 12 def email_available? !link_with_href('mailto').nil? end |
.english_contact_page(url) ⇒ Object
Looks for english page. Gets page if available then looks for english contact page.
If no english link is available, it will blind test ‘../en’ and ‘../english’. Returns nil if nothing can be found.
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# File 'lib/gimme_poc/contactpage.rb', line 39 def english_contact_page(url) puts "\nLooking for english page..." english_link = page.link_with(href: %r{en\/|english|English}) test_en_page = merged_link('../en') test_english_page = merged_link('../english') case when !english_link.nil? puts "#{'Success:'.green} found english link!" get(merged_link(english_link.uri)) else blind_test(test_en_page) || blind_test(test_english_page) puts "\n(restarting)\n" contact_page(url) end end |
.format_url(str) ⇒ Object
Mechanize needs absolute urls to work. If http:// or https:// isn’t present, append http://.
36 37 38 |
# File 'lib/gimme_poc/web.rb', line 36 def format_url(str) LazyDomain.autohttp(str) end |
.get(str) ⇒ Object
Go to a page using Mechanize. Sleep for a split second to not overload any servers.
Returns nil if bad url is given.
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# File 'lib/gimme_poc/web.rb', line 9 def get(str) url = format_url(str) puts "sending GET request to: #{url}" sleep(0.1) @page = Mechanize.new do |a| a.user_agent_alias = 'Mac Safari' a.open_timeout = 7 a.read_timeout = 7 a.idle_timeout = 7 a.redirect_ok = true end.get(url) rescue Mechanize::ResponseCodeError => e puts "#{'Response Error:'.red} #{e}" rescue SocketError => e puts "#{'Socket Error:'.red} #{e}" rescue Net::OpenTimeout => e puts "#{'Connection Timeout:'.red} #{e}" rescue Errno::ETIMEDOUT => e puts "#{'Connection Timeout:'.red} #{e}" rescue Net::HTTP::Persistent::Error puts "#{'Connection Timeout:'.red} read timeout, too many resets." end |
.go_to_contact_page(url) ⇒ Object
Scans for contact page. If it doesn’t work on the first try, It will look for english versions and try again. Processes left to right.
Returns nil if no contact page can be found.
9 10 11 |
# File 'lib/gimme_poc/contactpage.rb', line 9 def go_to_contact_page(url) contact_page(url) || english_contact_page(url) end |
.link_with_href(str) ⇒ Object
Expects relative paths and merges everything. Returns a string. If there’s nothing, return nil.
Add b word block to ensure whole word is searched.
70 71 72 73 74 |
# File 'lib/gimme_poc/web.rb', line 70 def link_with_href(str) merged_link(page.link_with(href: /\b#{str}/).uri.to_s) rescue nil end |
.memory ⇒ Object
Convenience method.
54 55 56 |
# File 'lib/gimme_poc.rb', line 54 def memory Search.all_sites end |
.merged_link(url_str) ⇒ Object
Used in case of relative paths. Merging guarantees correct url. This needs a url string as argument to work. Produces a merged uri string.
61 62 63 |
# File 'lib/gimme_poc/web.rb', line 61 def merged_link(url_str) page.uri.merge(url_str).to_s end |
.orig_domain(str) ⇒ Object
Outputs domain of a url. Useful if subdomains are given to GimmePOC and they don’t work.
For example: Given maps.google.com, returns ‘google.com’.
51 52 53 54 55 |
# File 'lib/gimme_poc/web.rb', line 51 def orig_domain(str) LazyDomain.parse(str).domain rescue PublicSuffix::DomainInvalid => e puts "#{'Invalid Domain:'.red} #{e}" end |
.phone_available? ⇒ Boolean
Boolean, returns true if phone number is present.
17 18 19 |
# File 'lib/gimme_poc/questions.rb', line 17 def phone_available? !(page.body =~ PHONE_REGEX).nil? end |
.poc(arr) ⇒ Object
The main method! Takes array of urls and gets contact info for each if possible. If url is bad, it’s converted to nil in ‘get’ method and skipped over.
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
# File 'lib/gimme_poc.rb', line 26 def poc(arr) arr = arr.split unless arr.is_a?(Array) arr.each do |url| puts '-' * 50 puts "starting: #{url}" unless LazyDomain.valid?(url) puts "#{'Invalid Domain:'.red} `#{url}' is not a valid domain" next end case when subdomain?(url) puts '(This url is a subdomain. Will try both sub and root domain.)' next if get(url).nil? && get(orig_domain(url)).nil? else next if get(url).nil? end start_contact_links mechpage = go_to_contact_page(url) if mechpage.nil? puts '(empty page, exiting.)' else save_available_contacts(mechpage.uri.to_s) end end Search.all_sites # Return results from all sites. end |
.reset! ⇒ Object
Clears entire collection.
59 60 61 |
# File 'lib/gimme_poc.rb', line 59 def reset! Search.all_sites = [] end |
.save_available_contacts(url, hsh = scan_for_contacts) ⇒ Object
Saves any available contact info to @contact_links.
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# File 'lib/gimme_poc/save.rb', line 43 def save_available_contacts(url, hsh = scan_for_contacts) if something_to_save?(hsh) puts "\nsaving available contact information from #{url}" if hsh.is_a?(Hash) hsh.each do |k, v| save_link(k, v) # saves to @contact_links end delete_failures(@contact_links) puts "#{@contact_links}".cyan # same as @contact_links else fail ArgumentError, "expected hash but got #{hsh.class}" end Search::POC.new(url, @contact_links) else puts '(nothing to save)' return end end |
.save_link(key, url) ⇒ Object
Used in save_available_contacts to save each valid link.
29 30 31 32 |
# File 'lib/gimme_poc/save.rb', line 29 def save_link(key, url) return if key.nil? || url.nil? @contact_links[key] = url end |
.scan_for_contacts ⇒ Object
Returns anything that is possible to save, otherwise returns nil. Booleans for phone, email, or contact form will display True or False.
Add periods to link hrefs to prevent false positives. Must escape periods with a backslash or else it will be a regex wild card.
9 10 11 12 13 14 15 16 17 18 19 20 21 |
# File 'lib/gimme_poc/save.rb', line 9 def scan_for_contacts { contactpage: link_with_href('contact'), email_present: "#{email_available?}", phone_present: "#{phone_available?}", contact_form: "#{contactform_available?}", facebook: link_with_href('facebook\.'), twitter: link_with_href('twitter\.'), youtube: link_with_href('youtube\.'), googleplus: link_with_href('plus\.google\.'), linkedin: link_with_href('linkedin\.') } end |
.something_to_save?(hsh) ⇒ Boolean
Boolean, returns true if anything is present after running scan_for_contacts and deleting failures.
7 8 9 |
# File 'lib/gimme_poc/questions.rb', line 7 def something_to_save?(hsh) delete_failures(hsh).any? end |
.start_contact_links ⇒ Object
Starts/Restarts @contacts_links hash
24 25 26 |
# File 'lib/gimme_poc/save.rb', line 24 def start_contact_links @contact_links = {} end |
.subdomain?(str) ⇒ Boolean
Boolean, returns true if url is not identical to original domain.
77 78 79 |
# File 'lib/gimme_poc/web.rb', line 77 def subdomain?(str) (unformat_url(str) != orig_domain(str)) end |
.unformat_url(str) ⇒ Object
Used for subdomain check. Not a permanent change to url variable.
41 42 43 |
# File 'lib/gimme_poc/web.rb', line 41 def unformat_url(str) str.gsub(HTTP_REGEX, '') end |