Class: RightScraper::Retrievers::CheckoutBase

Inherits:
Base
  • Object
show all
Defined in:
lib/right_scraper/retrievers/checkout_base.rb

Overview

Base class for retrievers that want to do version control operations (CVS, SVN, etc.). Subclasses can get away with implementing only Retrievers::Base#available? and #do_checkout but to support incremental operation need to implement #exists? and #do_update, in addition to Retrievers::Base#ignorable_paths.

Direct Known Subclasses

Git, Svn

Instance Attribute Summary

Attributes inherited from Base

#logger, #max_bytes, #max_seconds, #repo_dir, #repository

Instance Method Summary collapse

Methods inherited from Base

#available?, #ignorable_paths, #initialize, repo_dir

Constructor Details

This class inherits a constructor from RightScraper::Retrievers::Base

Instance Method Details

#do_checkoutTrueClass

Perform a de novo full checkout of the repository. Subclasses must override this to do anything useful.

Returns:

  • (TrueClass)

    always true

Raises:

  • (NotImplementedError)


157
158
159
# File 'lib/right_scraper/retrievers/checkout_base.rb', line 157

def do_checkout
  raise NotImplementedError
end

#do_updateTrueClass

Perform an incremental update of the checkout. Subclasses that want to handle incremental updating need to override this.

Returns:

  • (TrueClass)

    always true

Raises:

  • (NotImplementedError)


165
166
167
# File 'lib/right_scraper/retrievers/checkout_base.rb', line 165

def do_update
  raise NotImplementedError
end

#do_update_tagTrueClass

Updates the tag of the repository associated with this retriever to refer to the HEAD commit (SHA) on disk after retrieval.

Returns:

  • (TrueClass)

    always true

Raises:

  • (NotImplementedError)


173
174
175
# File 'lib/right_scraper/retrievers/checkout_base.rb', line 173

def do_update_tag
  raise NotImplementedError
end

#exists?Boolean

Return true if a checkout exists.

Returns

Boolean

true if the checkout already exists (and thus incremental updating can occur).

Returns:

  • (Boolean)


117
118
119
# File 'lib/right_scraper/retrievers/checkout_base.rb', line 117

def exists?
  false
end

#remote_differs?TrueClass|FalseClass

Determines if the remote SHA/tag/branch referenced by the repostory differs from what appears on disk, if possible. Not all retrievers will have this capability. If not, the retriever should default to returning true to indicate that the remote is changed.

Returns:

  • (TrueClass|FalseClass)

    true if changed



127
128
129
# File 'lib/right_scraper/retrievers/checkout_base.rb', line 127

def remote_differs?
  true
end

#retrieveObject

Attempts to update and then resorts to clean checkout for repository.

Raises:



39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# File 'lib/right_scraper/retrievers/checkout_base.rb', line 39

def retrieve
  raise RetrieverError.new("retriever is unavailable") unless available?
  updated = false
  explanation = ''
  if exists?
    @logger.operation(:updating) do
      # a retriever may be able to determine that the repo directory is
      # already pointing to the same commit as the revision. in that case
      # we can return quickly.
      if remote_differs?
        # there is no point in updating and failing the size check when the
        # directory on disk already exceeds size limit; fall back to a clean
        # checkout in hopes that the latest revision corrects the issue.
        if size_limit_exceeded?
          explanation = 'switching to checkout due to existing directory exceeding size limimt'
        else
          # attempt update.
          begin
            do_update
            updated = true
          rescue ::RightScraper::Processes::Shell::LimitError
            # update exceeded a limitation; requires user intervention
            raise
          rescue Exception => e
            # retry with clean checkout after discarding repo dir.
            explanation = 'switching to checkout after unsuccessful update'
          end
        end
      else
        # no retrieval needed but warn exactly why we didn't do full
        # checkout to avoid being challenged about it.
        repo_ref = @repository.tag
        do_update_tag
        full_head_ref = @repository.tag
        abbreviated_head_ref = full_head_ref[0..6]
        if repo_ref == full_head_ref || repo_ref == abbreviated_head_ref
          detail = abbreviated_head_ref
        else
          detail = "#{repo_ref} = #{abbreviated_head_ref}"
        end
        message =
          "Skipped updating local directory due to the HEAD commit SHA " +
          "on local matching the remote repository reference (#{detail})."
        @logger.note_warning(message)
        return false
      end
    end
  end

  # Clean checkout only if not updated.
  unless updated
    @logger.operation(:checkout, explanation) do
      # remove any full or partial directory before attempting a clean
      # checkout in case repo_dir is in a bad state.
      if exists?
        ::FileUtils.remove_entry_secure(@repo_dir)
      end
      ::FileUtils.mkdir_p(@repo_dir)
      begin
        do_checkout
      rescue Exception
        # clean checkout failed; repo directory is in an undetermined
        # state and must be deleted to prevent a future update attempt.
        if exists?
          ::FileUtils.remove_entry_secure(@repo_dir) rescue nil
        end
        raise
      end
    end
  end
  true
end

#size_limit_exceeded?TrueClass|FalseClass

Determines if total size of files in repo_dir has exceeded size limit.

Return

Returns:

  • (TrueClass|FalseClass)

    true if size limit exceeded



135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# File 'lib/right_scraper/retrievers/checkout_base.rb', line 135

def size_limit_exceeded?
  if @max_bytes
    # note that Dir.glob ignores hidden directories (e.g. ".git") so the
    # size total correctly excludes those hidden contents that are not to
    # be uploaded after scrape. this may cause the on-disk directory size
    # to far exceed the upload size.
    globbie = ::File.join(@repo_dir, '**/*')
    size = 0
    ::Dir.glob(globbie) do |f|
      size += ::File.stat(f).size rescue 0 if ::File.file?(f)
      break if size > @max_bytes
    end
    size > @max_bytes
  else
    false
  end
end